[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: decoder not working on one LDM machine but it does on anotherLDM and Re: pbuf_flush



On Thu, 6 Apr 2000, Gregory Grosshans wrote:

> Robb,
> 
> Yes the ldmd.log file has been showing multiple dcmetr processes 
> starting/running.
> 
> Here is some additional information about the K220, K370 and the K220 feeding 
> the
> NOAAPORT data to the first two machines.
> 
> The K220 ingesting the NOAAPORT data and feeding the other K220 (with the 
> dcmetr
> problem) and K370 also runs dcmetr with no problems.  This machine is 
> ingesting the
> NWSTG and GOES-E channel into LDM with a modified version of the FSL 
> acqserver (makes
> the NOAAPORT data look like a FOS feed).
> 
> With a short little awk script I totalled the number of products ingested and 
> total
> bytes for the three K machines and each feedtype:
> 
> The K220 with the NWSTG and GOES-E channels ingesting into LDM (as FSL5) has 
> the
> following for a 24 hour period:
>     FSL5                             ALL FEEDS (FSL5,DDS,PPS,AFOS)
>  325216 products           418782 products
> 5.685 GB of data            5.762 GB of data
> 
> The K370 also has PCWS:
>     FSL5                            ALL FEEDS(FSL5,DDS,PPS,AFOS,PCWS)
> 252001 products            345711 products
> 4.676 GB of data            4.819 GB of data
> 
> The K220 (with DCMETR problem):
>     FSL5                            ALL FEEDS (FSL5,DDS,PPS,AFOS)
> 44536 products               136896 products
> 14 MB of data                   93 MB of data
> 
> The DDS feed has approx. 45000 products/day and 31 MB/day
> The PPS feed has approx. 8100 products/day and 20 MB/day
> The AFOS feed has approx. 39000 products/day and 26 MB/day
> The PCWS feed has approx. 144 products/day and 65 MB/day.
> 
> Two items are noteworthy.  The K370 is using a request FSL5   (^[A-FK-X])
> data.spc.noaa.gov
> With this request line it receives 252001 products/day compared to DDS/PPS at 
> 53000
> products/day.  I was under the impression the request line resulted in 
> something
> similar to the DDS/PPS feed.  However, it must be ingesting many other 
> products I don't
> know about (e.g. grib, etc.).
> 
> Second, I was surprised at how small the total LDM data ingest was.  I 
> mentioned to the
> sys admin that you think the K220 is to slow.  This machine is the server 
> that 8
> operational workstations point to for NAWIPS data.  He thinks it may be that 
> the disks
> cant keep up.  Right now the particular filesystem has mincache set to 
> tmpcache
> (performance instead of integrity), a while ago he changed this setting to 
> dsync to
> make sure the OS writes the data to disk.  When this filesystem change was 
> done the
> load on the system jumped way up (20s) and it was changed back to tmpcache.  
> So the
> disks on this system are constantly busy.
> 
Greg,

Maybe my statement that the K220 computer is too slow is too general. It
is the disk I/O that is too slow to support the LDM activity and the
support of the 8 operational workstations. That's what the pbuf_flush
entries in the ldmd.log file are referrring too. Computer environments
have transition points from working correctly to not working correctly
can be narrow sometimes. I don't know your exact configuration, but I
think if you would rearrange the configuration to lesson the disk I/O then
the dcmetr decoder would quit exiting. If the dcmetr decoder keeps the
high number of pbuf_flush entires you will continue to have the problem.

Robb... 


> With the above information do you still suspect the machine is to slow to 
> handle the
> additional data, even though the total LDM ingest of products is considerably 
> smaller?
> 
> Thanks,
> Gregg
> 
> 
> Robb Kambic wrote:
> 
> > Greg,
> >
> > After much discussion, I believe the problem is that the K220 computer is
> > too slow to support the dcmetr decoder with the extra FSL5 feed.  That's
> > why you are seeing all the pbuf_flush entries in the log files.  What we
> > believe is happening, the dcmetr decoder PIPE is getting filled, so the
> > LDM is spawning another dcmetr decoder.  SInce both decoders are writing
> > to the same file, it's getting a signal 11 and making a corrupt output
> > file. You should check the Gempak log file for muliple starts of the
> > dcmetr decoder without shutting down messages.  Also, if you are just
> > using the NOAAport stream the dcmetr like you said it work OK, ie it can
> > keep up.
> >
> > Robb...
> >
> > ~On Thu, 6 Apr 2000, Gregory Grosshans wrote:
> >
> > > I recently added a NOAAPORT feed similar to the DDS/PPS feed (i.e. 
> > > request FSL5
> > > (^[A-FK-X]) ), to a machine running LDM (5.0.9) already receiving the 
> > > following
> > > feedtypes (AFOS, DDS, PPS).  Once the NOAAPORT feed started coming in the 
> > > metar
> > > decoder (i.e. dcmetr) began terminating frequently from signal 11 
> > > resulting in a
> > > corrupted GEMPAK surface observation file.  This particular machine is a 
> > > K220 (2
> > > 100 MHz CPUs with 1 GB RAM).
> > >
> > > On a K370 (3 200 MHz CPUs and faster with 1.2 GB RAM) the same LDM 
> > > (5.0.8) setup
> > > exists and the resulting surface file is fine and the dcmetr doesn't 
> > > terminate.
> > >
> > > Do you have any ideas on why a problem exists on one machine but not on 
> > > the
> > > other?  Have you seen anything like this before?  I've provided additional
> > > information about the two configurations and trouble shooting steps I've 
> > > already
> > > taken below.  The NOAAPORT ingest runs on a second K210 server with LDM 
> > > (5.0.8)
> > > running and the FSL acqserver code (modified by SPC/AWC to make the data 
> > > look like
> > > its from the FOS).  This second K210 then feeds NOAAPORT data (via the 
> > > FSL5
> > > feedtype) to the K210 and K370.
> > >
> > > Thanks,
> > > Gregg
> > >
> > > Some specifics follow:
> > >
> > > On the K370 (everything works fine) the following configuration follows:
> > >
> > > ---cut from ldmd.conf---
> > > exec    "pqexpire -a .5"
> > > request AFOS    .*      hostname-removed
> > > request DDS     .*      hostname-removed
> > > request PPS     .*      hostname-removed
> > > request FSL5    (^[A-FK-X])     hostname-removed  This is the NOAAPORT 
> > > stream
> > > received in-house
> > > request PCWS    "^FSL\.NetCDF\.ACARS\.QC\..*"   hostname-removed
> > > ---
> > > $pq_size = 750000000;
> > >
> > > The K370 uses dcuair and dcmetr decoders and the FEEDTYPE for each 
> > > decoder is
> > > FSL5|DDS.  This allows the receipt of data to be ASAP as well as 
> > > redundant (in
> > > case the FSL5 feed goes down).
> > >
> > > On the K210 (everything except the metar GEMPAK file works fine) the 
> > > following
> > > configuration follows:
> > >
> > > ---cut from the ldmd.conf---
> > > exec    "pqexpire -a .23333"
> > > request AFOS    .*      hostname-removed
> > > request DDS     .*      hostname-removed
> > > request PPS     .*      hostname-removed
> > > #request FSL5   (^[A-FK-X])     hostname-removed
> > > # UA data and AVN MOS
> > > #request FSL5   (^U[EGHKLPQSZ]|^FE)             hostname-removed
> > > # SFC/UA data and AVN MOS
> > > request FSL5    (^U[EGHKLPQSZ]|^FE|^S[AP])              hostname-removed
> > > ---
> > > #$pq_size = 750000000;
> > > $pq_size = 200000000;
> > >
> > > The K210 uses dcuair and dcmetr decoders.  The dcuair output is fine.  
> > > The dcmetr
> > > output becomes corrupted.
> > >
> > > Before adding the entire NOAAPORT feed to the K210 it was receiving the 
> > > UA and AVN
> > > MOS data from NOAAPORT just fine.  The LDM at this time was 5.0.8.  After 
> > > adding
> > > the entire NOAAPORT stream (request ^[A-FK-X]) the dcmetr became 
> > > corrupted.  I
> > > then copied the
> > > dcmetr decoder from the K370 to the K210 with the GEMPAK surface file 
> > > still
> > > becoming corrupted.  Next, I upgraded to LDM-5.0.9 on the K210 but still 
> > > no
> > > success.  Then I decreased the request statement to be just upperair, 
> > > metar and
> > > AVN MOS data.  However the surface file was still becoming corrupted.  
> > > The most
> > > recent attempt was to lower the queue size from 750 MB to 200 MB (since 
> > > i'm not
> > > ingesting all the NOAAPORT data) but the surface file still becomes 
> > > corrupted.
> > >
> > > PBUF messages are still received  sporadically, usually soon after 
> > > starting LDM
> > > up:
> > >
> > > ldmspock@spock> ll ldmd.*
> > > -rw-r--r--   1 ldmspock   data       8522013 Apr  6 13:56 ldmd.log
> > > -rw-r--r--   1 ldmspock   data       5645263 Apr  6 00:00 ldmd.log.1
> > > -rw-r--r--   1 ldmspock   data       1262077 Apr  5 15:19 ldmd.log.2
> > > -rw-r--r--   1 ldmspock   data       8324510 Apr  5 13:19 ldmd.log.3
> > > -rw-r--r--   1 ldmspock   data       2973459 Apr  5 00:00 ldmd.log.4
> > > -rw-r--r--   1 ldmspock   data       5841177 Apr  4 19:03 ldmd.log.5
> > > -rw-r--r--   1 ldmspock   data          3835 Apr  4 16:25 ldmd.log.6
> > > -rw-r--r--   1 ldmspock   data       26420210 Apr  4 16:23 ldmd.log.7
> > > ldmspock@spock> grep pbuf ldmd.log ldmd.log.1
> > > ldmd.log:Apr 06 06:10:12 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 3.339170
> > > ldmd.log:Apr 06 06:10:23 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 6.926433
> > > ldmd.log:Apr 06 06:10:25 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.288729
> > > ldmd.log.1:Apr 05 15:25:05 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 4.764068
> > > ldmd.log.1:Apr 05 15:25:14 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 5.619390
> > > ldmd.log.1:Apr 05 15:25:20 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 3.642364
> > > ldmd.log.1:Apr 05 15:25:35 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 11.463704
> > > ldmd.log.1:Apr 05 15:25:46 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 7.433896
> > > ldmd.log.1:Apr 05 15:25:53 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 5.602335
> > > ldmd.log.1:Apr 05 15:26:12 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 6.995401
> > > ldmd.log.1:Apr 05 15:26:17 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.613388
> > > ldmd.log.1:Apr 05 15:26:22 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 3.780494
> > > ldmd.log.1:Apr 05 15:26:34 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.924638
> > > ldmd.log.1:Apr 05 15:26:44 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.035412
> > > ldmd.log.1:Apr 05 15:30:16 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 3.280656
> > > ldmd.log.1:Apr 05 21:10:08 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 5.369202
> > > ldmd.log.1:Apr 05 23:10:09 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.265447
> > >
> > > DCMETR processes killed follow.  The processes killed during 15Z in 
> > > ldmd.log.1
> > > were another decoder receiving test products.  As you can see there 
> > > haven't been
> > > many child processes killed today, but the file is corrupted.  Performing 
> > > a sflist
> > > on the surface obs file resulted in a segmentation fault will the 1Z 
> > > output was
> > > displaying.
> > > grep child ldmd.log ldmd.log.1
> > > ldmd.log:Apr 06 04:07:27 spock pqact[17920]: child 5137 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 15:28:49 spock pqact[17920]: child 20923 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:28:50 spock pqact[17920]: child 20938 terminated by
> > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:28:57 spock pqact[17920]: child 21095 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:31:52 spock pqact[17920]: child 24263 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:31:54 spock pqact[17920]: child 24292 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:33:17 spock pqact[17920]: child 25397 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 16:07:30 spock pqact[17920]: child 5674 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 16:13:49 spock pqact[17920]: child 15384 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 16:31:30 spock pqact[17920]: child 24463 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 16:37:15 spock pqact[17920]: child 27998 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 16:57:56 spock pqact[17920]: child 9436 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 17:02:03 spock pqact[17920]: child 10868 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:17:29 spock pqact[17920]: child 20707 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:20:54 spock pqact[17920]: child 20898 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:21:53 spock pqact[17920]: child 22583 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:23:35 spock pqact[17920]: child 22924 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:27:07 spock pqact[17920]: child 23765 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:30:04 spock pqact[17920]: child 25231 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:41:46 spock pqact[17920]: child 741 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 17:46:38 spock pqact[17920]: child 1216 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 17:50:08 spock pqact[17920]: child 3376 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 18:01:35 spock pqact[17920]: child 8924 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 18:01:55 spock pqact[17920]: child 9569 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 18:02:08 spock pqact[17920]: child 9736 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 18:04:20 spock pqact[17920]: child 9786 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 18:05:06 spock pqact[17920]: child 10241 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:08:47 spock pqact[17920]: child 15683 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:14:30 spock pqact[17920]: child 16273 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:15:40 spock pqact[17920]: child 17962 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:15:44 spock pqact[17920]: child 18700 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:16:07 spock pqact[17920]: child 18714 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:16:08 spock pqact[17920]: child 18899 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:26:21 spock pqact[17920]: child 21652 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:26:41 spock pqact[17920]: child 23263 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:46:58 spock pqact[17920]: child 28329 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:57:50 spock pqact[17920]: child 6435 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 18:58:56 spock pqact[17920]: child 6554 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 19:02:36 spock pqact[17920]: child 7837 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 19:03:07 spock pqact[17920]: child 8165 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 19:03:50 spock pqact[17920]: child 8377 terminated by 
> > > signal 11
> > > ldmd.log.1:Apr 05 19:08:03 spock pqact[17920]: child 13935 terminated by 
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 19:09:06 spock pqact[17920]: child 14026 terminated by 
> > > signal 11
> > >
> > >
> > >
> > >
> > >
> > > Robb Kambic wrote:
> > >
> > > > On Thu, 30 Mar 2000, Gregory Grosshans wrote:
> > > >
> > > > > Can someone explain what a 'pbuf_flash 15' entry in the ldmd.log file
> > > > > means?
> > > > >
> > > > > This LDM ingests the Gateway channel from NOAAPORT (as FSL5) and PPS,
> > > > > DDS   (note request
> > > > > patter below).  The ldm queue is 750 MB.
> > > >
> > > > Greg,
> > > >
> > > > The pbuf_flush 15 means that the ACTION ie ( FILE, STDIO, PIPE, etc )
> > > > could not write to the destinations because the resource was not 
> > > > available
> > > > at that time or it was blocked.  The log entry ususally occurs when disk
> > > > activity is at it's peak for some reason.  It could be the LDM doing 
> > > > many
> > > > disk writes or another process tying up the disk.
> > > >
> > > > time elapsed 5.690810
> > > >
> > > > means that it took 5.6 sec to complete the activity,
> > > >
> > > > If this occurs a bunch of time, one could start to loose data.  Then 
> > > > it's
> > > > time to determine the reason for the pbuf_flush entries.
> > > >
> > > > Robb...
> > > > >
> > > > > Thanks,
> > > > > Gregg Grosshans
> > > > > SPC
> > > > >
> > > > > ---information from ldmd.conf---------
> > > > > pqexpire -a .23333
> > > > > request FSL5    (^[A-FK-X])     (hostname was here)
> > > > >
> > > > > ---information from ldmadmin---------
> > > > > $pq_size = 750000000;
> > > > >
> > > > >
> > > > > ------portion of ldmd.log-------------
> > > > >
> > > > > Mar 30 15:07:24 spock pqact[14333]: pbuf_flush 15: time elapsed
> > > > > 5.690810
> > > > > Mar 30 15:07:24 spock DCMETR[10991]: Normal Termination
> > > > > Mar 30 15:07:24 spock DCMETR[11773]: Starting Up
> > > > > Mar 30 15:07:31 spock pqact[14333]: pbuf_flush 15: time elapsed
> > > > > 6.166595
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > ===============================================================================
> > > > Robb Kambic                                Unidata Program Center
> > > > Software Engineer III                      Univ. Corp for Atmospheric 
> > > > Research
> > > > address@hidden                   WWW: http://www.unidata.ucar.edu/
> > > > ===============================================================================
> > >
> >
> > ===============================================================================
> > Robb Kambic                                Unidata Program Center
> > Software Engineer III                      Univ. Corp for Atmospheric 
> > Research
> > address@hidden                   WWW: http://www.unidata.ucar.edu/
> > ===============================================================================
> 

===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================