[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: "pbuf_flush: time elapsed" problem
- Subject: Re: "pbuf_flush: time elapsed" problem
- Date: Thu, 06 Oct 2005 09:43:39 -0600
Justin,
>Date: Thu, 06 Oct 2005 09:23:50 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA
>To: Steve Emmerson <address@hidden>
>Subject: Re: "pbuf_flush: time elapsed" problem
The above message contained the following:
> We completed the installation of LDM 6.4.2.4 yesterday (~15Z) and
> things were running well until 01:08Z today, that's when we again
> stopped receiving NEXRAD2 data. I did a restart of LDM once I got in
> this morning and NEXRAD2 is coming in.
>
> The TOC Level II radar cluster feeds us the NEXRAD2 data (IP:
> 140.90.85.102).
>
> I have attached the log files from that system; they only seem to have
> the standard logging enabled and an LDM restart is in their cron for
> twice a day (this has to do with their cluster availability).
>
> I didn't see anything around the 01:08Z (10/06/05) time that would
> indicate why we stopped receiving NEXRAD2 data.
Did the outage correspond to a restart of the LDM on 140.90.85.102?
If you send USR2 signals to the downstream LDM process that's receiving
the NEXRAD2 data-products from 140.90.85.102, then you can put the
process into debug mode and see where it's spending its time. Just
don't leave it in that mode! :-)
> Here is a snippet from our ldmd.log at the time we stopped receiving data:
> ---
> Oct 06 01:08:25 b2n1 140.90.85.102[1437928] INFO: 10539
> 20051006010822.446 NEXRAD2 386060 L2-BZIP2/KMPX/20051006010419/386/60
> Oct 06 01:08:25 b2n1 pqact[815190] INFO: 10539 20051006010822.446
> NEXRAD2 386060 L2-BZIP2/KMPX/20051006010419/386/60
ASIDE: The above entries indicate that a data-product that has the time
2005-10-06 01:04:19 encoded in it's product-identifier was created at
2005-10-06 01:08:22.446 -- a difference of over 4 minutes. I hope this is
OK.
> Oct 06 01:08:25 b2n1 pqact[815190] INFO: file: -close
> /dcomdev/us007003/ldmdata/obs/upperair/nexrad_level2/KMPX/KMPX_20051006_010419.bz2
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: 3666 20051006010825.900
> FNEXRAD 555 UJXX82 KWBC 060104
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: file:
> /dcomdev/us007003/ldmdata/obs/upperair/sonde/20051006.sonde
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: pipe:
> bin/decod_dcusnd -v 2 -t 600 -d
> /dcomdev/us007003/decoder_logs/decod_dcusnd.log
> tables/stns/sonde.land.tbl tables/stns/sonde.ship.tbl
> /dcomdev/us007003/bufrtab.002
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: 84 20051006010825.900
> FNEXRAD 556 UANT01 KWBC 060105
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: file:
> /dcomdev/us007003/ldmdata/obs/upperair/aircraft/20051006.aircraft
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: pipe:
> bin/decod_dcacft -v 2 -t 300 -d
> /dcomdev/us007003/decoder_logs/decod_dcacft.log
> tables/stns/pirep.tbl tables/stns/airep.tbl
> /dcomdev/us007003/bufrtab.004
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: 194 20051006010825.900
> FNEXRAD 557 UPPK01 OPKC 060000
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: file:
> /dcomdev/us007003/ldmdata/obs/upperair/sonde/20051006.sonde
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: pipe:
> bin/decod_dcusnd -v 2 -t 600 -d
> /dcomdev/us007003/decoder_logs/decod_dcusnd.log
> tables/stns/sonde.land.tbl tables/stns/sonde.ship.tbl
> /dcomdev/us007003/bufrtab.002
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: 144 20051006010825.900
> FNEXRAD 558 SAGR33 LGAT 060100
> Oct 06 01:08:25 b2n1 140.90.85.102[1437928] INFO: 22130
> 20051006010822.509 NEXRAD2 559026 L2-BZIP2/KLWX/20051006005943/559/26
ASIDE: The last entry indicates that a data-product that has the time
2005-10-06 00:59:43 encoded in it's product-identifier was created at
2005-10-06 01:08:22.509 -- a difference of almost 9 minutes. I hope
this is OK.
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: file:
> /dcomdev/us007003/ldmdata/obs/surface/metar/20051006.metar
> Oct 06 01:08:25 b2n1 pqact[889074] INFO: pipe:
> bin/decod_dcmetr -v 2 -t 300 -d
> /dcomdev/us007003/decoder_logs/decod_dcmetr.log
> /dcomdev/us007003/bufrtab.000 tables/stns/metar.tbl
> Oct 06 01:08:25 b2n1 pqing[208898] INFO: 556 20051006010825.964
> FNEXRAD 579 SAUS70 KWBC 060100 RRE
> Oct 06 01:08:25 b2n1 pqing[208898] INFO: 136 20051006010825.964
> FNEXRAD 580 SAUS80 KWBC 060100 RRE
> Oct 06 01:08:25 b2n1 pqing[208898] INFO: 140 20051006010825.968
> FNEXRAD 581 SPUS80 KWBC 060104
> Oct 06 01:08:25 b2n1 pqing[208898] INFO: 149 20051006010825.969
> FNEXRAD 582 SASD31 OEJD 060100 RRA
> Oct 06 01:08:25 b2n1 pqing[208898] INFO: 72 20051006010825.971
> FNEXRAD 583 ULCM01 FKKD 060000
> Oct 06 01:08:25 b2n1 pqing[208898] INFO: 82 20051006010825.972
> FNEXRAD 584 SAAG00 SACO 060100
> Oct 06 01:08:25 b2n1 pqing[208898] INFO: 1177 20051006010825.974
> FNEXRAD 585 SAIY40 LIIB 060100
> ----
>
> The last entry related to NEXRAD2 is this entry into the queue: "Oct 06
> 01:08:25 b2n1 140.90.85.102[1437928] INFO: 22130 20051006010822.509
> NEXRAD2 559026 L2-BZIP2/KLWX/20051006005943/559/26", pqact never acts
> on the entry...
>
> I was poking around the email archives and saw where you suggested to a
> user that 4 instances of pqact is sometimes a good idea when working
> with the CRAFT data,
>
> pqact -f CRAFT -p BZIP2/K[A-D] etc/pqact.craft
> pqact -f CRAFT -p BZIP2/K[E-K] etc/pqact.craft
> pqact -f CRAFT -p BZIP2/K[L-R] etc/pqact.craft
> pqact -f CRAFT -p BZIP2/K[S-Z] etc/pqact.craft
>
> Do you think adding these entries to our ldmd.conf would make any
> difference?
Processing data-products using multiple, disjoint pqact(1) processes can
improve throughput over a single pqact(1) process if the single process
has reached the limit on the number of open file-descriptors and has to
constantly close file-descriptors in order to open a new ones. You can
determine the limit on the number of open file descriptors via the
commands "ulimit -a" or "getconf OPEN_MAX".
Also, be sure that the pqact(1) configuration-file contains only
relevant entries because the pqact(1) program tests every data-product
against every entry.
Using a faster computer can also help -- as will having sufficient
memory so that the relevant processes remain in memory rather than being
swapped to disk (the top(1) utility can help determine this).
> Thanks again for the help,
>
> Justin
[snip]
Regards,
Steve Emmerson