[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: "pbuf_flush: time elapsed" problem
- Subject: Re: "pbuf_flush: time elapsed" problem
- Date: Tue, 18 Oct 2005 15:06:18 -0600
Justin,
>Date: Tue, 18 Oct 2005 16:38:00 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA/NWS/FSL
>To: Steve Emmerson <address@hidden>
>Subject: Re: "pbuf_flush: time elapsed" problem
The above message contained the following:
> Our NEXRAD2 feed stopped and we had our sys admins execute truss on the
> rpc.ldmd process; below is the log for ~25 mins.
I assume that when you say "our NEXRAD2 feed" that you mean the UPSTREAM
LDM process that's sending NEXRAD2 data-products. Please tell me if
this assumption is incorrect.
> ***While truss was attached to the "hung" NEXRAD2 feed we executed a
> kill -USR2 on the rpc.ldmd process, you can see where it caught it but
> returned to the sleeping state.
> ----
> thread_waitlock_local(0x0000000000000000) (sleeping...)
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
> thread_waitlock_local(0x0000000000000000) = 1
> Received signal #31, SIGUSR2 [caught]
> _getpid() = 1228948
> send(2, 0x0FFFFFFFFFFF6C58, 63, 0) = 63
> ksetcontext_sigreturn(0x0FFFFFFFFFFF7680, 0x0000000000000000,
> 0x00000001100652B0, 0x800000000000D0B2, 0x0000000000000000,
> 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
> Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
The above is consistent with the upstream LDM not seeing any
data-products that match its selection-criteria. In that situation,
the upstream LDM receives a SIGCONT when a data-product is added to the
product-queue, wakes up, checks if any of the recent entries in the
product-queue match its selection-criteria, and then goes back to sleep
if none do.
Because you indicated that data-products were added to the product-queue
that do match the selection-criteria for that upstream LDM, I can only
conclude that the selection-criteria are being corrupted, somehow.
Would you be willing to modify the LDM source-code and then rebuild and
reinstall it with debugging and assertions enabled?
> Something else that may be of interest, we noticed that after the feed
> stopped there was a defunct process with the PPID listed as the PID of
> our NEXRAD2 feed (output from ps -ef for the PID 1228948):
>
> dbndev 532636 1228948 0 0:00 <defunct>
> dbndev 1228948 1028176 0 Oct 13 - 24:21 rpc.ldmd -v -q
> /usr/ldm/data/ldm.pq /usr/ldm/etc/ldmd.conf
>
> Any ideas?
This is extremely puzzling because upstream LDM processes don't call
fork(2) -- so they can't have child processes.
grep(1) the LDM logfiles to verify that the PID is that of an upstream
LDM, e.g.
fgrep '[1228948]' `ls -rt logs/ldmd.log*`
> Thanks,
>
> Jusitn
Regards,
Steve Emmerson