This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Justin, >Date: Tue, 18 Oct 2005 16:38:00 -0400 >From: Justin Cooke <address@hidden> >Organization: NOAA/NWS/FSL >To: Steve Emmerson <address@hidden> >Subject: Re: "pbuf_flush: time elapsed" problem The above message contained the following: > Our NEXRAD2 feed stopped and we had our sys admins execute truss on the > rpc.ldmd process; below is the log for ~25 mins. I assume that when you say "our NEXRAD2 feed" that you mean the UPSTREAM LDM process that's sending NEXRAD2 data-products. Please tell me if this assumption is incorrect. > ***While truss was attached to the "hung" NEXRAD2 feed we executed a > kill -USR2 on the rpc.ldmd process, you can see where it caught it but > returned to the sleeping state. > ---- > thread_waitlock_local(0x0000000000000000) (sleeping...) > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x0000000000000000) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x0000000000000000) (sleeping...) > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x0000000000000000) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x0000000000000000) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x0000000000000000) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x0000000000000000) (sleeping...) > thread_waitlock_local(0x0000000000000000) = 1 > Received signal #31, SIGUSR2 [caught] > _getpid() = 1228948 > send(2, 0x0FFFFFFFFFFF6C58, 63, 0) = 63 > ksetcontext_sigreturn(0x0FFFFFFFFFFF7680, 0x0000000000000000, > 0x00000001100652B0, 0x800000000000D0B2, 0x0000000000000000, > 0x0000000000000000, 0x0000000000000000, 0x0000000000000000) > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) > Received signal #19, SIGCONT [caught] > thread_waitlock_local(0x09001000A024F2F8) (sleeping...) The above is consistent with the upstream LDM not seeing any data-products that match its selection-criteria. In that situation, the upstream LDM receives a SIGCONT when a data-product is added to the product-queue, wakes up, checks if any of the recent entries in the product-queue match its selection-criteria, and then goes back to sleep if none do. Because you indicated that data-products were added to the product-queue that do match the selection-criteria for that upstream LDM, I can only conclude that the selection-criteria are being corrupted, somehow. Would you be willing to modify the LDM source-code and then rebuild and reinstall it with debugging and assertions enabled? > Something else that may be of interest, we noticed that after the feed > stopped there was a defunct process with the PPID listed as the PID of > our NEXRAD2 feed (output from ps -ef for the PID 1228948): > > dbndev 532636 1228948 0 0:00 <defunct> > dbndev 1228948 1028176 0 Oct 13 - 24:21 rpc.ldmd -v -q > /usr/ldm/data/ldm.pq /usr/ldm/etc/ldmd.conf > > Any ideas? This is extremely puzzling because upstream LDM processes don't call fork(2) -- so they can't have child processes. grep(1) the LDM logfiles to verify that the PID is that of an upstream LDM, e.g. fgrep '[1228948]' `ls -rt logs/ldmd.log*` > Thanks, > > Jusitn Regards, Steve Emmerson