[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[LDM #XND-654237]: Can viewing ldm.pq cause queue corruption?
- Subject: [LDM #XND-654237]: Can viewing ldm.pq cause queue corruption?
- Date: Thu, 04 Aug 2011 11:44:57 -0600
Justin,
> Here at NCEP we are running ldm-6.7.0 on our IBM P6 cluster supercomputer
> (running AIX 5.2). We have been running this version of LDM for over a year
> with no trouble, but this morning our ldm processes began to fail.
>
> First all of our requests and downstream feeds stopped:
> Aug 04 15:26:41 c1n5 local0:info pqact[237704] INFO: 34884
> 20110804152641.591 NEXRAD2 124004 L2-BZIP2/KTBW/20110804152630/124/4/I/V03/0
> Aug 04 15:26:41 c1n5 local0:info pqact[237704] INFO: file:
> -close
> /dcom/us007003/ldmdata/obs/upperair/nexrad_level2/KTBW/KTBW_20110804_152630.bz2
> Aug 04 15:26:42 c1n5 local0:info pqing[483668] INFO: 14867
> 20110804152637.206 FNEXRAD 023 SXUS72 KWBC 041524
> Aug 04 15:26:45 c1n5 local0:info pqing[459286] INFO: 1358
> 20110804152640.008 HDS 617 SDUS41 KGYX 041517 /pRCMGYX
> Aug 04 15:27:37 c1n5 local0:info c3n3p.ncep.noaa.gov[913510] INFO: Connection
> from c3n3p.ncep.noaa.gov
> Aug 04 15:27:37 c1n5 local0:info c3n3p.ncep.noaa.gov[913510] INFO: Exiting
> Aug 04 15:27:37 c1n5 local0:info rpc.ldmd[484050] INFO: child 913510 exited
> with status 0
> Aug 04 15:28:37 c1n5 local0:info c3n3p.ncep.noaa.gov[918520] INFO: Connection
> from c3n3p.ncep.noaa.gov
> Aug 04 15:28:37 c1n5 local0:info c3n3p.ncep.noaa.gov[918520] INFO: Exiting
> Aug 04 15:28:37 c1n5 local0:info rpc.ldmd[484050] INFO: child 918520 exited
> with status 0
> Aug 04 15:29:37 c1n5 local0:info c3n3p.ncep.noaa.gov[278546] INFO: Connection
> from c3n3p.ncep.noaa.gov
> Aug 04 15:29:37 c1n5 local0:info c3n3p.ncep.noaa.gov[278546] INFO: Exiting
> Aug 04 15:29:37 c1n5 local0:info rpc.ldmd[484050] INFO: child 278546 exited
> with status 0
The connection from host "c3n3p" and termination of the associated upstream LDM
are odd but don't indicate a problem with the LDM on host "c1n5". I suspect
that the TCP connection is being destroyed by the network by, for example, a
firewall rule.
> Aug 04 15:29:42 c1n5 local0:info 140.90.85.102[578372] INFO: 25548
> 20110804152641.639 NEXRAD2 697049 L2-BZIP2/KOHX/20110804152305/697/49/I/V03/0
> Aug 04 15:29:42 c1n5 local0:err|error 140.90.85.102[578372] ERROR: readtcp():
> EOF on socket 4
> Aug 04 15:29:42 c1n5 local0:err|error 140.90.85.102[578372] ERROR:
> one_svc_run(): RPC layer closed connection
> Aug 04 15:29:42 c1n5 local0:err|error 140.90.85.102[578372] ERROR:
> Disconnecting due to LDM failure; Connection to upstream LDM closed
> Aug 04 15:29:42 c1n5 local0:notice 140.90.85.102[578372] NOTE: LDM-6 desired
> product-class: 20110804142942.773 TS_ENDT {{NEXRAD2, ".*"},{NONE,
> "SIG=1ad0d1658b39396d3715d4eafd4f9f9e"}}
> ...
The termination of the downstream LDM process 578372 was due to a closed TCP
connection to upstream host 140.90.85.102. Again, I suspect the network.
Because the downstream LDM process will repeatedly try to reestablish the
connection to the upstream LDM, this temporary disconnection should matter
providing the underlying cause was transitory.
> Then our two pqact processes exited:
> Aug 04 15:30:00 c1n5 local0:notice pqact[790878] NOTE: Exiting
> Aug 04 15:30:00 c1n5 local0:notice pqact[237704] NOTE: Exiting
> Aug 04 15:30:00 c1n5 local0:notice pqact[237704] NOTE: Behind by 17.3867 s
> Aug 04 15:30:00 c1n5 local0:notice pqact[790878] NOTE: Behind by 17.4668 s
> Aug 04 15:30:00 c1n5 local0:info rpc.ldmd[484050] INFO: child 237704 exited
> with status 0: pqact -f CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.craft
> Aug 04 15:30:00 c1n5 local0:info rpc.ldmd[484050] INFO: child 790878 exited
> with status 0: pqact -f ANY-CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.conf
This is very odd. The only ways a pqact(1) process can terminate with a status
of 0 are to have reached the end of the queue after being started with the "-i
0" option (unlikely) or to have received a TERM signal, which could only have
been sent by the LDM user or the superuser.
> Our monitoring began to flag indicating the pqact had stopped running. We
> logged in and attempted to stop ldm by hand, but our cron entries to restart
> LDM were already trying to restart the software so we ended up giving the LDM
> processes a kill -9, remade the queue, and restarted the LDM. Everything has
> been running fine since.
>
> Around the time of the failure we know a 'more' was done to the ldm.pq file
> by a userid that does not have write permission to the ldm.pq file. Could
> viewing the ldm.pq file cause queue corruption or some other situation which
> explains why our LDM began to fail?
I'm afraid that I don't see how a read-only more(1) of the product-queue could
cause a pqact(1) process to terminate, much less with a status of 0. This
assumes, of course, that the operating-system behaves correctly.
> We run this version of ldm on both of our supercomputers, both are receiving
> an independent feed of the same observational data, however the failure only
> occurred on the system where the more was performed. All of the evidence is
> pointing to that being the cause, but before we pin the failure on it we just
> wanted to run this by you.
Good idea. Always glad to help.
> Thanks,
>
> Justin Cooke
> NCEP Central Operations
Regards,
Steve Emmerson
Ticket Details
===================
Ticket ID: XND-654237
Department: Support LDM
Priority: Normal
Status: Closed