[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: decoder not working on one LDM machine but it does on anotherLDM and Re: pbuf_flush
- Subject: Re: decoder not working on one LDM machine but it does on anotherLDM and Re: pbuf_flush
- Date: Fri, 7 Apr 2000 11:42:29 -0600 (MDT)
On Thu, 6 Apr 2000, Gregory Grosshans wrote:
> Robb,
>
> Yes the ldmd.log file has been showing multiple dcmetr processes
> starting/running.
>
> Here is some additional information about the K220, K370 and the K220 feeding
> the
> NOAAPORT data to the first two machines.
>
> The K220 ingesting the NOAAPORT data and feeding the other K220 (with the
> dcmetr
> problem) and K370 also runs dcmetr with no problems. This machine is
> ingesting the
> NWSTG and GOES-E channel into LDM with a modified version of the FSL
> acqserver (makes
> the NOAAPORT data look like a FOS feed).
>
> With a short little awk script I totalled the number of products ingested and
> total
> bytes for the three K machines and each feedtype:
>
> The K220 with the NWSTG and GOES-E channels ingesting into LDM (as FSL5) has
> the
> following for a 24 hour period:
> FSL5 ALL FEEDS (FSL5,DDS,PPS,AFOS)
> 325216 products 418782 products
> 5.685 GB of data 5.762 GB of data
>
> The K370 also has PCWS:
> FSL5 ALL FEEDS(FSL5,DDS,PPS,AFOS,PCWS)
> 252001 products 345711 products
> 4.676 GB of data 4.819 GB of data
>
> The K220 (with DCMETR problem):
> FSL5 ALL FEEDS (FSL5,DDS,PPS,AFOS)
> 44536 products 136896 products
> 14 MB of data 93 MB of data
>
> The DDS feed has approx. 45000 products/day and 31 MB/day
> The PPS feed has approx. 8100 products/day and 20 MB/day
> The AFOS feed has approx. 39000 products/day and 26 MB/day
> The PCWS feed has approx. 144 products/day and 65 MB/day.
>
> Two items are noteworthy. The K370 is using a request FSL5 (^[A-FK-X])
> data.spc.noaa.gov
> With this request line it receives 252001 products/day compared to DDS/PPS at
> 53000
> products/day. I was under the impression the request line resulted in
> something
> similar to the DDS/PPS feed. However, it must be ingesting many other
> products I don't
> know about (e.g. grib, etc.).
>
> Second, I was surprised at how small the total LDM data ingest was. I
> mentioned to the
> sys admin that you think the K220 is to slow. This machine is the server
> that 8
> operational workstations point to for NAWIPS data. He thinks it may be that
> the disks
> cant keep up. Right now the particular filesystem has mincache set to
> tmpcache
> (performance instead of integrity), a while ago he changed this setting to
> dsync to
> make sure the OS writes the data to disk. When this filesystem change was
> done the
> load on the system jumped way up (20s) and it was changed back to tmpcache.
> So the
> disks on this system are constantly busy.
>
Greg,
Maybe my statement that the K220 computer is too slow is too general. It
is the disk I/O that is too slow to support the LDM activity and the
support of the 8 operational workstations. That's what the pbuf_flush
entries in the ldmd.log file are referrring too. Computer environments
have transition points from working correctly to not working correctly
can be narrow sometimes. I don't know your exact configuration, but I
think if you would rearrange the configuration to lesson the disk I/O then
the dcmetr decoder would quit exiting. If the dcmetr decoder keeps the
high number of pbuf_flush entires you will continue to have the problem.
Robb...
> With the above information do you still suspect the machine is to slow to
> handle the
> additional data, even though the total LDM ingest of products is considerably
> smaller?
>
> Thanks,
> Gregg
>
>
> Robb Kambic wrote:
>
> > Greg,
> >
> > After much discussion, I believe the problem is that the K220 computer is
> > too slow to support the dcmetr decoder with the extra FSL5 feed. That's
> > why you are seeing all the pbuf_flush entries in the log files. What we
> > believe is happening, the dcmetr decoder PIPE is getting filled, so the
> > LDM is spawning another dcmetr decoder. SInce both decoders are writing
> > to the same file, it's getting a signal 11 and making a corrupt output
> > file. You should check the Gempak log file for muliple starts of the
> > dcmetr decoder without shutting down messages. Also, if you are just
> > using the NOAAport stream the dcmetr like you said it work OK, ie it can
> > keep up.
> >
> > Robb...
> >
> > ~On Thu, 6 Apr 2000, Gregory Grosshans wrote:
> >
> > > I recently added a NOAAPORT feed similar to the DDS/PPS feed (i.e.
> > > request FSL5
> > > (^[A-FK-X]) ), to a machine running LDM (5.0.9) already receiving the
> > > following
> > > feedtypes (AFOS, DDS, PPS). Once the NOAAPORT feed started coming in the
> > > metar
> > > decoder (i.e. dcmetr) began terminating frequently from signal 11
> > > resulting in a
> > > corrupted GEMPAK surface observation file. This particular machine is a
> > > K220 (2
> > > 100 MHz CPUs with 1 GB RAM).
> > >
> > > On a K370 (3 200 MHz CPUs and faster with 1.2 GB RAM) the same LDM
> > > (5.0.8) setup
> > > exists and the resulting surface file is fine and the dcmetr doesn't
> > > terminate.
> > >
> > > Do you have any ideas on why a problem exists on one machine but not on
> > > the
> > > other? Have you seen anything like this before? I've provided additional
> > > information about the two configurations and trouble shooting steps I've
> > > already
> > > taken below. The NOAAPORT ingest runs on a second K210 server with LDM
> > > (5.0.8)
> > > running and the FSL acqserver code (modified by SPC/AWC to make the data
> > > look like
> > > its from the FOS). This second K210 then feeds NOAAPORT data (via the
> > > FSL5
> > > feedtype) to the K210 and K370.
> > >
> > > Thanks,
> > > Gregg
> > >
> > > Some specifics follow:
> > >
> > > On the K370 (everything works fine) the following configuration follows:
> > >
> > > ---cut from ldmd.conf---
> > > exec "pqexpire -a .5"
> > > request AFOS .* hostname-removed
> > > request DDS .* hostname-removed
> > > request PPS .* hostname-removed
> > > request FSL5 (^[A-FK-X]) hostname-removed This is the NOAAPORT
> > > stream
> > > received in-house
> > > request PCWS "^FSL\.NetCDF\.ACARS\.QC\..*" hostname-removed
> > > ---
> > > $pq_size = 750000000;
> > >
> > > The K370 uses dcuair and dcmetr decoders and the FEEDTYPE for each
> > > decoder is
> > > FSL5|DDS. This allows the receipt of data to be ASAP as well as
> > > redundant (in
> > > case the FSL5 feed goes down).
> > >
> > > On the K210 (everything except the metar GEMPAK file works fine) the
> > > following
> > > configuration follows:
> > >
> > > ---cut from the ldmd.conf---
> > > exec "pqexpire -a .23333"
> > > request AFOS .* hostname-removed
> > > request DDS .* hostname-removed
> > > request PPS .* hostname-removed
> > > #request FSL5 (^[A-FK-X]) hostname-removed
> > > # UA data and AVN MOS
> > > #request FSL5 (^U[EGHKLPQSZ]|^FE) hostname-removed
> > > # SFC/UA data and AVN MOS
> > > request FSL5 (^U[EGHKLPQSZ]|^FE|^S[AP]) hostname-removed
> > > ---
> > > #$pq_size = 750000000;
> > > $pq_size = 200000000;
> > >
> > > The K210 uses dcuair and dcmetr decoders. The dcuair output is fine.
> > > The dcmetr
> > > output becomes corrupted.
> > >
> > > Before adding the entire NOAAPORT feed to the K210 it was receiving the
> > > UA and AVN
> > > MOS data from NOAAPORT just fine. The LDM at this time was 5.0.8. After
> > > adding
> > > the entire NOAAPORT stream (request ^[A-FK-X]) the dcmetr became
> > > corrupted. I
> > > then copied the
> > > dcmetr decoder from the K370 to the K210 with the GEMPAK surface file
> > > still
> > > becoming corrupted. Next, I upgraded to LDM-5.0.9 on the K210 but still
> > > no
> > > success. Then I decreased the request statement to be just upperair,
> > > metar and
> > > AVN MOS data. However the surface file was still becoming corrupted.
> > > The most
> > > recent attempt was to lower the queue size from 750 MB to 200 MB (since
> > > i'm not
> > > ingesting all the NOAAPORT data) but the surface file still becomes
> > > corrupted.
> > >
> > > PBUF messages are still received sporadically, usually soon after
> > > starting LDM
> > > up:
> > >
> > > ldmspock@spock> ll ldmd.*
> > > -rw-r--r-- 1 ldmspock data 8522013 Apr 6 13:56 ldmd.log
> > > -rw-r--r-- 1 ldmspock data 5645263 Apr 6 00:00 ldmd.log.1
> > > -rw-r--r-- 1 ldmspock data 1262077 Apr 5 15:19 ldmd.log.2
> > > -rw-r--r-- 1 ldmspock data 8324510 Apr 5 13:19 ldmd.log.3
> > > -rw-r--r-- 1 ldmspock data 2973459 Apr 5 00:00 ldmd.log.4
> > > -rw-r--r-- 1 ldmspock data 5841177 Apr 4 19:03 ldmd.log.5
> > > -rw-r--r-- 1 ldmspock data 3835 Apr 4 16:25 ldmd.log.6
> > > -rw-r--r-- 1 ldmspock data 26420210 Apr 4 16:23 ldmd.log.7
> > > ldmspock@spock> grep pbuf ldmd.log ldmd.log.1
> > > ldmd.log:Apr 06 06:10:12 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 3.339170
> > > ldmd.log:Apr 06 06:10:23 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 6.926433
> > > ldmd.log:Apr 06 06:10:25 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.288729
> > > ldmd.log.1:Apr 05 15:25:05 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 4.764068
> > > ldmd.log.1:Apr 05 15:25:14 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 5.619390
> > > ldmd.log.1:Apr 05 15:25:20 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 3.642364
> > > ldmd.log.1:Apr 05 15:25:35 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 11.463704
> > > ldmd.log.1:Apr 05 15:25:46 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 7.433896
> > > ldmd.log.1:Apr 05 15:25:53 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 5.602335
> > > ldmd.log.1:Apr 05 15:26:12 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 6.995401
> > > ldmd.log.1:Apr 05 15:26:17 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.613388
> > > ldmd.log.1:Apr 05 15:26:22 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 3.780494
> > > ldmd.log.1:Apr 05 15:26:34 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.924638
> > > ldmd.log.1:Apr 05 15:26:44 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.035412
> > > ldmd.log.1:Apr 05 15:30:16 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 3.280656
> > > ldmd.log.1:Apr 05 21:10:08 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 5.369202
> > > ldmd.log.1:Apr 05 23:10:09 spock pqact[17920]: pbuf_flush 35: time elapsed
> > > 2.265447
> > >
> > > DCMETR processes killed follow. The processes killed during 15Z in
> > > ldmd.log.1
> > > were another decoder receiving test products. As you can see there
> > > haven't been
> > > many child processes killed today, but the file is corrupted. Performing
> > > a sflist
> > > on the surface obs file resulted in a segmentation fault will the 1Z
> > > output was
> > > displaying.
> > > grep child ldmd.log ldmd.log.1
> > > ldmd.log:Apr 06 04:07:27 spock pqact[17920]: child 5137 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 15:28:49 spock pqact[17920]: child 20923 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:28:50 spock pqact[17920]: child 20938 terminated by
> > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:28:57 spock pqact[17920]: child 21095 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:31:52 spock pqact[17920]: child 24263 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:31:54 spock pqact[17920]: child 24292 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 15:33:17 spock pqact[17920]: child 25397 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 16:07:30 spock pqact[17920]: child 5674 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 16:13:49 spock pqact[17920]: child 15384 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 16:31:30 spock pqact[17920]: child 24463 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 16:37:15 spock pqact[17920]: child 27998 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 16:57:56 spock pqact[17920]: child 9436 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 17:02:03 spock pqact[17920]: child 10868 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:17:29 spock pqact[17920]: child 20707 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:20:54 spock pqact[17920]: child 20898 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:21:53 spock pqact[17920]: child 22583 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:23:35 spock pqact[17920]: child 22924 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:27:07 spock pqact[17920]: child 23765 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:30:04 spock pqact[17920]: child 25231 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 17:41:46 spock pqact[17920]: child 741 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 17:46:38 spock pqact[17920]: child 1216 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 17:50:08 spock pqact[17920]: child 3376 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 18:01:35 spock pqact[17920]: child 8924 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 18:01:55 spock pqact[17920]: child 9569 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 18:02:08 spock pqact[17920]: child 9736 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 18:04:20 spock pqact[17920]: child 9786 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 18:05:06 spock pqact[17920]: child 10241 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:08:47 spock pqact[17920]: child 15683 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:14:30 spock pqact[17920]: child 16273 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:15:40 spock pqact[17920]: child 17962 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:15:44 spock pqact[17920]: child 18700 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:16:07 spock pqact[17920]: child 18714 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:16:08 spock pqact[17920]: child 18899 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:26:21 spock pqact[17920]: child 21652 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:26:41 spock pqact[17920]: child 23263 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:46:58 spock pqact[17920]: child 28329 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 18:57:50 spock pqact[17920]: child 6435 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 18:58:56 spock pqact[17920]: child 6554 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 19:02:36 spock pqact[17920]: child 7837 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 19:03:07 spock pqact[17920]: child 8165 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 19:03:50 spock pqact[17920]: child 8377 terminated by
> > > signal 11
> > > ldmd.log.1:Apr 05 19:08:03 spock pqact[17920]: child 13935 terminated by
> > > signal 11
> > >
> > > ldmd.log.1:Apr 05 19:09:06 spock pqact[17920]: child 14026 terminated by
> > > signal 11
> > >
> > >
> > >
> > >
> > >
> > > Robb Kambic wrote:
> > >
> > > > On Thu, 30 Mar 2000, Gregory Grosshans wrote:
> > > >
> > > > > Can someone explain what a 'pbuf_flash 15' entry in the ldmd.log file
> > > > > means?
> > > > >
> > > > > This LDM ingests the Gateway channel from NOAAPORT (as FSL5) and PPS,
> > > > > DDS (note request
> > > > > patter below). The ldm queue is 750 MB.
> > > >
> > > > Greg,
> > > >
> > > > The pbuf_flush 15 means that the ACTION ie ( FILE, STDIO, PIPE, etc )
> > > > could not write to the destinations because the resource was not
> > > > available
> > > > at that time or it was blocked. The log entry ususally occurs when disk
> > > > activity is at it's peak for some reason. It could be the LDM doing
> > > > many
> > > > disk writes or another process tying up the disk.
> > > >
> > > > time elapsed 5.690810
> > > >
> > > > means that it took 5.6 sec to complete the activity,
> > > >
> > > > If this occurs a bunch of time, one could start to loose data. Then
> > > > it's
> > > > time to determine the reason for the pbuf_flush entries.
> > > >
> > > > Robb...
> > > > >
> > > > > Thanks,
> > > > > Gregg Grosshans
> > > > > SPC
> > > > >
> > > > > ---information from ldmd.conf---------
> > > > > pqexpire -a .23333
> > > > > request FSL5 (^[A-FK-X]) (hostname was here)
> > > > >
> > > > > ---information from ldmadmin---------
> > > > > $pq_size = 750000000;
> > > > >
> > > > >
> > > > > ------portion of ldmd.log-------------
> > > > >
> > > > > Mar 30 15:07:24 spock pqact[14333]: pbuf_flush 15: time elapsed
> > > > > 5.690810
> > > > > Mar 30 15:07:24 spock DCMETR[10991]: Normal Termination
> > > > > Mar 30 15:07:24 spock DCMETR[11773]: Starting Up
> > > > > Mar 30 15:07:31 spock pqact[14333]: pbuf_flush 15: time elapsed
> > > > > 6.166595
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > ===============================================================================
> > > > Robb Kambic Unidata Program Center
> > > > Software Engineer III Univ. Corp for Atmospheric
> > > > Research
> > > > address@hidden WWW: http://www.unidata.ucar.edu/
> > > > ===============================================================================
> > >
> >
> > ===============================================================================
> > Robb Kambic Unidata Program Center
> > Software Engineer III Univ. Corp for Atmospheric
> > Research
> > address@hidden WWW: http://www.unidata.ucar.edu/
> > ===============================================================================
>
===============================================================================
Robb Kambic Unidata Program Center
Software Engineer III Univ. Corp for Atmospheric Research
address@hidden WWW: http://www.unidata.ucar.edu/
===============================================================================