This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Justin, > Here at NCEP we are running ldm-6.7.0 on our IBM P6 cluster supercomputer > (running AIX 5.2). We have been running this version of LDM for over a year > with no trouble, but this morning our ldm processes began to fail. > > First all of our requests and downstream feeds stopped: > Aug 04 15:26:41 c1n5 local0:info pqact[237704] INFO: 34884 > 20110804152641.591 NEXRAD2 124004 L2-BZIP2/KTBW/20110804152630/124/4/I/V03/0 > Aug 04 15:26:41 c1n5 local0:info pqact[237704] INFO: file: > -close > /dcom/us007003/ldmdata/obs/upperair/nexrad_level2/KTBW/KTBW_20110804_152630.bz2 > Aug 04 15:26:42 c1n5 local0:info pqing[483668] INFO: 14867 > 20110804152637.206 FNEXRAD 023 SXUS72 KWBC 041524 > Aug 04 15:26:45 c1n5 local0:info pqing[459286] INFO: 1358 > 20110804152640.008 HDS 617 SDUS41 KGYX 041517 /pRCMGYX > Aug 04 15:27:37 c1n5 local0:info c3n3p.ncep.noaa.gov[913510] INFO: Connection > from c3n3p.ncep.noaa.gov > Aug 04 15:27:37 c1n5 local0:info c3n3p.ncep.noaa.gov[913510] INFO: Exiting > Aug 04 15:27:37 c1n5 local0:info rpc.ldmd[484050] INFO: child 913510 exited > with status 0 > Aug 04 15:28:37 c1n5 local0:info c3n3p.ncep.noaa.gov[918520] INFO: Connection > from c3n3p.ncep.noaa.gov > Aug 04 15:28:37 c1n5 local0:info c3n3p.ncep.noaa.gov[918520] INFO: Exiting > Aug 04 15:28:37 c1n5 local0:info rpc.ldmd[484050] INFO: child 918520 exited > with status 0 > Aug 04 15:29:37 c1n5 local0:info c3n3p.ncep.noaa.gov[278546] INFO: Connection > from c3n3p.ncep.noaa.gov > Aug 04 15:29:37 c1n5 local0:info c3n3p.ncep.noaa.gov[278546] INFO: Exiting > Aug 04 15:29:37 c1n5 local0:info rpc.ldmd[484050] INFO: child 278546 exited > with status 0 The connection from host "c3n3p" and termination of the associated upstream LDM are odd but don't indicate a problem with the LDM on host "c1n5". I suspect that the TCP connection is being destroyed by the network by, for example, a firewall rule. > Aug 04 15:29:42 c1n5 local0:info 140.90.85.102[578372] INFO: 25548 > 20110804152641.639 NEXRAD2 697049 L2-BZIP2/KOHX/20110804152305/697/49/I/V03/0 > Aug 04 15:29:42 c1n5 local0:err|error 140.90.85.102[578372] ERROR: readtcp(): > EOF on socket 4 > Aug 04 15:29:42 c1n5 local0:err|error 140.90.85.102[578372] ERROR: > one_svc_run(): RPC layer closed connection > Aug 04 15:29:42 c1n5 local0:err|error 140.90.85.102[578372] ERROR: > Disconnecting due to LDM failure; Connection to upstream LDM closed > Aug 04 15:29:42 c1n5 local0:notice 140.90.85.102[578372] NOTE: LDM-6 desired > product-class: 20110804142942.773 TS_ENDT {{NEXRAD2, ".*"},{NONE, > "SIG=1ad0d1658b39396d3715d4eafd4f9f9e"}} > ... The termination of the downstream LDM process 578372 was due to a closed TCP connection to upstream host 140.90.85.102. Again, I suspect the network. Because the downstream LDM process will repeatedly try to reestablish the connection to the upstream LDM, this temporary disconnection should matter providing the underlying cause was transitory. > Then our two pqact processes exited: > Aug 04 15:30:00 c1n5 local0:notice pqact[790878] NOTE: Exiting > Aug 04 15:30:00 c1n5 local0:notice pqact[237704] NOTE: Exiting > Aug 04 15:30:00 c1n5 local0:notice pqact[237704] NOTE: Behind by 17.3867 s > Aug 04 15:30:00 c1n5 local0:notice pqact[790878] NOTE: Behind by 17.4668 s > Aug 04 15:30:00 c1n5 local0:info rpc.ldmd[484050] INFO: child 237704 exited > with status 0: pqact -f CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.craft > Aug 04 15:30:00 c1n5 local0:info rpc.ldmd[484050] INFO: child 790878 exited > with status 0: pqact -f ANY-CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.conf This is very odd. The only ways a pqact(1) process can terminate with a status of 0 are to have reached the end of the queue after being started with the "-i 0" option (unlikely) or to have received a TERM signal, which could only have been sent by the LDM user or the superuser. > Our monitoring began to flag indicating the pqact had stopped running. We > logged in and attempted to stop ldm by hand, but our cron entries to restart > LDM were already trying to restart the software so we ended up giving the LDM > processes a kill -9, remade the queue, and restarted the LDM. Everything has > been running fine since. > > Around the time of the failure we know a 'more' was done to the ldm.pq file > by a userid that does not have write permission to the ldm.pq file. Could > viewing the ldm.pq file cause queue corruption or some other situation which > explains why our LDM began to fail? I'm afraid that I don't see how a read-only more(1) of the product-queue could cause a pqact(1) process to terminate, much less with a status of 0. This assumes, of course, that the operating-system behaves correctly. > We run this version of ldm on both of our supercomputers, both are receiving > an independent feed of the same observational data, however the failure only > occurred on the system where the more was performed. All of the evidence is > pointing to that being the cause, but before we pin the failure on it we just > wanted to run this by you. Good idea. Always glad to help. > Thanks, > > Justin Cooke > NCEP Central Operations Regards, Steve Emmerson Ticket Details =================== Ticket ID: XND-654237 Department: Support LDM Priority: Normal Status: Closed