[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: "pbuf_flush: time elapsed" problem
- Subject: Re: "pbuf_flush: time elapsed" problem
- Date: Wed, 19 Oct 2005 08:58:05 -0600
Justin,
>Date: Wed, 19 Oct 2005 07:44:56 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA/NWS/FSL
>To: Steve Emmerson <address@hidden>
>Subject: Re: "pbuf_flush: time elapsed" problem
The above message contained the following:
[snip]
> Yes, I'm talking about the upstream LDM process.
[snip]
> > Would you be willing to modify the LDM source-code and then rebuild and
> > reinstall it with debugging and assertions enabled?
>
> Yes we would
Good. I'll let you know what to do.
> >> Something else that may be of interest, we noticed that after the feed
> >> stopped there was a defunct process with the PPID listed as the PID of
> >> our NEXRAD2 feed (output from ps -ef for the PID 1228948):
> >>
> >> dbndev 532636 1228948 0 0:00 <defunct>
> >> dbndev 1228948 1028176 0 Oct 13 - 24:21 rpc.ldmd -v -q
> >> /usr/ldm/data/ldm.pq /usr/ldm/etc/ldmd.conf
> >>
> >> Any ideas?
> >
> > This is extremely puzzling because upstream LDM processes don't call
> > fork(2) -- so they can't have child processes.
> >
> > grep(1) the LDM logfiles to verify that the PID is that of an upstream
> > LDM, e.g.
> >
> > fgrep '[1228948]' `ls -rt logs/ldmd.log*`
> >
>
> Here is some output from the grep:
>
> Oct 18 14:15:01 b2n1 140.90.85.102[1228948] ERROR: Terminating due to LDM
> failure; Connection to upstream LDM closed
> Oct 18 14:15:01 b2n1 140.90.85.102[1228948] NOTE: LDM-6 desired
> product-class: 20051018141401.214 TS_ENDT {{NEXRAD2, ".*"},{NONE,
> "SIG=a239ff9ff6fa47cb8ab19f7c5e476ae1"}}
> Oct 18 14:16:17 b2n1 140.90.85.102[1228948] ERROR: Terminating due to LDM
> failure; Couldn't connect to LDM on 140.90.85.102 using either port 388 or
> portmapper; : RPC: Remote system error - A remote host did not respond within
> the timeout period.
> Oct 18 14:16:18 b2n1 140.90.85.102[1228948] NOTE: LDM-6 desired
> product-class: 20051018141401.214 TS_ENDT {{NEXRAD2, ".*"},{NONE,
> "SIG=a239ff9ff6fa47cb8ab19f7c5e476ae1"}}
> Oct 18 14:16:18 b2n1 140.90.85.102[1228948] NOTE: Product reclassification by
> upstream LDM: 20051018141401.214 TS_ENDT {{NEXRAD2, ".*"},{NONE,
> "SIG=a239ff9ff6fa47cb8ab19f7c5e476ae1"}} -> 20051018141401.214 TS_ENDT
> {{NEXRAD2, ".*"}}
> Oct 18 14:16:18 b2n1 140.90.85.102[1228948] NOTE: Upstream LDM-6 on
> 140.90.85.102 is willing to be a primary feeder
> Oct 18 14:54:28 b2n1 140.90.85.102[1228948] NOTE: Going verbose
> Oct 18 14:54:29 b2n1 140.90.85.102[1228948] INFO: 9699 20051018145340.836
> NEXRAD2 382027 L2-BZIP2/KBMX/20051018145001/382/27
[snip]
The above messages indicate, conclusively, that process 1228948 was
a downstream LDM and not an upstream LDM. This is equally puzzling
because downstream LDMs don't call fork() either -- and so can't have
child processes.
More relevant, however, is your suggestion that process 1228948 was an
upstream LDM when it clearly wasn't. Would you please explain this
discrepancy.
> The LDM system that feeds us is restarted twice a day, that's why there
> is a connection failure ~14:15. At 14:54 I sent the 1228948 process a
> USR2 to go into verbose mode, once data stopped being received by the
> upstream LDM we attached truss.
>
> Again, this only seems to happen when the upstream ldm is in verbose
> mode. This process ran for 5 days in silent mode with no problems but
> stopped after 3 hours once it was put into verbose.
Hmm... That information might help. I'll need to know, however, whether
to look at the upstream or downstream LDM code.
> Thanks for continuing to look at this,
Thank you for bringing this up and continuing to work with me.
> Justin
Regards,
Steve Emmerson