[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20030806: 20030723: Problem with LDM/NOAAport ingestor
- Subject: 20030806: 20030723: Problem with LDM/NOAAport ingestor
- Date: Wed, 06 Aug 2003 10:58:17 -0600
Kevin,
The card we are using has either 8MB or 32MB of ram on board,
so its a little difficult to compare perfomance.
The bvig performance hit is calculating the MD5 for large products.
If you compute the MD5 along the way as you receive each 5Kb part,
then you will avoid the big delay in calling the MD5 computation after you
have the entire product (really important with 26MB images....but even
100Kb products in the nwstg channel would benefit.
Steve Chiswell
>From: "Kevin R. Tyle" <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200308061422.h76EM4Ld008336
>Hi Steve,
>
>An update: I rewrote the data ingestor so it reads in each
>frame, and sends it in the form of the ldm product structure
>directly to the ldm product queue using the structure as input
>to the pq_insert function. I still see the same frame loss
>crop up, especially once the queue fills up and the "self-cleaning"
>process begins. It appears that the time required for pq_insert
>to return delays things just long enough for the program to miss
>frames when the next call to "recvfrom" retrieves the next frame
>from the card, although the timestamps when pq_insert is enabled
>do not seem to show much significant change.
>
>Anything else you could recommend I check? I have an email in
>to Cyclades asking if I am accessing the buffer on the card.
>The RAM buffer is 256K, or about 64 frames.
>
>Thanks . . .
>
>Kevin
>______________________________________________________________________
>Kevin Tyle, Systems Administrator **********************
>Dept. of Earth & Atmospheric Sciences address@hidden
>University at Albany, ES-235 518-442-4571 (voice)
>1400 Washington Avenue 518-442-5825 (fax)
>Albany, NY 12222 **********************
>______________________________________________________________________
>On Fri, 25 Jul 2003, Kevin R. Tyle wrote:
>
>> Hi Steve,
>>
>> comments below . . .
>>
>> On Thu, 24 Jul 2003, Unidata Support wrote:
>>
>> >
>> > Kevin,
>> >
>> > The 4 channel system I wrote reads the incoming data, computes the
>> > MD5 checksum as the data streams in, and then inserts into the queue direc
> tly.
>> > This avoids other processes, named pipes, and the like. It also
>> > allows the MD5 to be computed as the data blocks arrive, rather than
>> > waiting for the entire product to arrive and then have
>> > pqing compute the checksum...which is much more important with
>> > 26MB satellite images. Also, your PC card probably has a RAM
>> > buffer on it- so if necessary, your card will provide the
>> > buffer space.
>> >
>> > Some points here you may want to consider:
>> >
>> > 1) you didn't say it- but I'm assuming you are using pqing to read from
>> > your named pipes. It sounds like your named pipe would have to be full
>> > in order to drop something. Is your program checking for this
>> > condition? How do you handle it....or do things get dropped on the floor?
>> >
>>
>> Yep, three separate instances of pqing are launched when the ldm starts,
>> and they read from the DDPLUS, HDS, and NMC3 named pipes. At this point,
>> I'm not checking for a full pipe (how is this done, anyway?), but it
>> does seem that when the data is written to the pipe, it all gets ingested
>> and properly handled by pqact. The check for sequential HDLC frame numbers
>> is done when the hdlc frame is read in from the cyclades card. Based on
>> debugging output, it appears that the frames are missed entirely and thus
>> never get a chance to be processed and written to the pipe. Running the
>> ingestor without the LDM, with or without output to a FIFO, shows no
>> frame loss.
>>
>> > 2) You generally don't want a program being dependent onsomething else wit
> hout
>> > buffering. In your approach, you are loosing the benefit of the
>> > on board memory of the card. Your buffer in the pipe is probably limited.
>> > One alternative is to have your program write to a cyclical file
>> > (buffer), and have a separate process read from the cyclical file and feed
> the
>> > FIFO....but you would still need to be checking for write errors.
>> >
>> >
>>
>> I am going to follow what I think you did by modifying our program to
>> directly access and write to the queue, eliminating the use of named
>> pipes. My plan to do this is to incorporate the relevant parts of
>> pqing into the program, and we'll see how it goes.
>>
>> I believe the size of a named pipe is limited to 4096 bytes at the
>> kernel level. Although each HDLC frame appears to be less than
>> that (at least on the NWSTG channel), a full product clearly
>> will exceed this often. Although if the problem was with the pipe,
>> wouldn't I see the products not make it through the LDM? I'm not
>> fully aware of how the named pipes work, but maybe things are delayed
>> while the LDM reads data out of the pipe just long enough for the
>> main program to lose incoming data from the card.
>>
>> Thanks for the advice . . .
>>
>> --Kevin
>>
>>
>>
>> > The LDM queue cleaning is generally efficient, pqexpire is much more costl
> y
>> > since it has to search the queue. A fast machine should not be noticing
>> > that overhead. You probably want a larger queue anyhow, since a T-1 is cap
> able
>> > of exceeding 400MB an hour.
>> >
>> > Steve Chiswell
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > >From: "Kevin R. Tyle" <address@hidden>
>> > >Organization: UCAR/Unidata
>> > >Keywords: 200307232218.h6NMI8Ld008737
>> >
>> > >Hi,
>> > >
>> > >First, this question pertains to work that I am doing as a
>> > >consultant for a non-Unidata member (MESO, Inc.), so I understand this
>> > >might not be the right place to send this, but hey, it's an
>> > >interesting problem.
>> > >
>> > >MESO basically did a "do-it-yourself" installation of a NOAAport
>> > >system. Besides the appropriate satellite dish/EFR-54 Receiver system,
>> > >we use a 2.6 GHz dual CPU Intel P-4 that is running RH 8.0. A Cyclades
>> > >PC300 card is used on the PC to receive the data from the receiver.
>> > >Presently, we are only ingesting data on the NCEP/NWSTG channel.
>> > >The PC has three 36 GB SCSI disks, and use the EXT3 logging
>> > >filesystem (although I have experimented placing the LDM
>> > >product queue on its own disk separate from the rest
>> > >of the LDM-related files, on a non-logging ext2 filesystem.)
>> > >
>> > >Our ingest program receives frames from the card, strips out the
>> > >extraneous headers, and basically puts everything into an
>> > >LDM-friendly format. Depending on the WMO ID, products are
>> > >separated into DDPLUS, HDS, and NMC3 feeds. The data is output
>> > >into three named pipes, corresponding to the three data feeds.
>> > >The LDM then reads from these named pipes.
>> > >
>> > >Basically, when I start the LDM, everything goes well, for a time.
>> > >All frames are received (we check for sequential frame #'s and
>> > >product ID's). But, after a certain period of time, say an
>> > >hour or so, we begin to lose frames. Sometimes a couple, sometimes
>> > >about 10 or so. And once it starts, it's basically useless until
>> > >the ingestor and LDM are restarted. If I run the ingestor without
>> > >the LDM (e.g., just cat'ing the named pipes into /dev/null), no
>> > >frame skipping occurs.
>> > >
>> > >I knew I was onto something when I found that when I remade the
>> > >queue, things would always work well for an hour or so. I began
>> > >to suspect that when the queue reached it's full size, we started
>> > >to see the frame loss.
>> > >
>> > >Here is an example from today. I started the ingestor at
>> > >1845 UTC. All goes well for about 90 minutes. Then, I get
>> > >this in the output from the ingestor:
>> > >
>> > >WMOID = SPAK32, Cat. = 1,LDM sqnm = 688, feed = DDS,Product ID # = 940200
>> > >030723/20:22:45
>> > >
>> > >Previous Frame ID = 634, Current Frame ID = 635
>> > >
>> > >WMOID = UANT01, Cat. = 7,LDM sqnm = 689, feed = DDS,Product ID # = 940201
>> > >030723/20:22:45
>> > >
>> > >Previous Frame ID = 635, Current Frame ID = 636
>> > >
>> > >WMOID = SDUS23, Cat. = 1,LDM sqnm = 690, feed = RAD,Product ID # = 940202
>> > >030723/20:22:45
>> > >
>> > >Previous Frame ID = 636, Current Frame ID = 643
>> > >
>> > >*** BREAK IN FRAME # SEQUENCE!! ***
>> > >
>> > >WMOID = SDUS22, Cat. = 1,LDM sqnm = 691, feed = RAD,Product ID # = 940203
>> > >
>> > >030723/20:22:45
>> > >
>> > >Previous Frame ID = 643, Current Frame ID = 644
>> > >030723/20:22:45
>> > >
>> > >Previous Frame ID = 644, Current Frame ID = 645
>> > >030723/20:22:45
>> > >
>> > >Previous Frame ID = 645, Current Frame ID = 646
>> > >030723/20:22:45
>> > >
>> > >Previous Frame ID = 646, Current Frame ID = 647
>> > >
>> > >WMOID = SDUS51, Cat. = 1,LDM sqnm = 692, feed = RAD,Product ID # = 940205
>> > >
>> > >*** BREAK IN PRODUCT NUMBER SEQUENCE!! ***
>> > >
>> > >Now look at the pqmon output from about that time:
>> > >
>> > >Jul 23 20:22:29 lightning2 pqmon[15276]: 36025 1 61630 397335184
>> > >59835 4 37820 2667888 3169
>> > >Jul 23 20:23:29 lightning2 pqmon[15276]: 36018 1 61637 398231640
>> > >59835 4 37820 1771432 3169
>> > >Jul 23 20:24:29 lightning2 pqmon[15276]: 36027 1 61628 399151056
>> > >59835 4 37820 852016 3169
>> > >Jul 23 20:25:29 lightning2 pqmon[15276]: 36238 1 61417 399592472
>> > >59835 4 37820 410600 3168
>> > >Jul 23 20:26:29 lightning2 pqmon[15276]: 36216 1 61439 399991480
>> > >59835 4 37820 11592 3163
>> > >Jul 23 20:27:29 lightning2 pqmon[15276]: 36186 1 61469 399994920
>> > >59835 4 37820 8152 3147
>> > >Jul 23 20:28:29 lightning2 pqmon[15276]: 36057 1 61598 399999864
>> > >59835 4 37820 3208 3137
>> > >Jul 23 20:29:30 lightning2 pqmon[15276]: 35646 1 62009 399980208
>> > >59835 4 37820 22864 3124
>> > >Jul 23 20:30:30 lightning2 pqmon[15276]: 35283 1 62372 399994480
>> > >59835 4 37820 8592 3112
>> > >Jul 23 20:31:30 lightning2 pqmon[15276]: 35462 1 62193 400000696
>> > >59835 4 37820 2376 3120
>> > >Jul 23 20:32:30 lightning2 pqmon[15276]: 34906 1 62749 400000632
>> > >59835 4 37820 2440 3077
>> > >Jul 23 20:33:30 lightning2 pqmon[15276]: 34858 1 62797 399999192
>> > >59835 4 37820 3880 3057
>> > >Jul 23 20:34:30 lightning2 pqmon[15276]: 34290 1 63365 399996784
>> > >59835 4 37820 6288 2959
>> > >Jul 23 20:35:30 lightning2 pqmon[15276]: 33682 1 63973 399997352
>> > >59835 4 37820 5720 2885
>> > >Jul 23 20:36:30 lightning2 pqmon[15276]: 33596 1 64059 399998024
>> > >59835 4 37820 5048 2861
>> > >Jul 23 20:37:30 lightning2 pqmon[15276]: 32904 1 64751 399992952
>> > >59835 4 37820 10120 2805
>> > >Jul 23 20:38:30 lightning2 pqmon[15276]: 32380 1 65275 399999784
>> > >59835 4 37820 3288 2708
>> > >Jul 23 20:39:30 lightning2 pqmon[15276]: 32487 1 65168 399989456
>> > >59835 4 37820 13616 2706
>> > >Jul 23 20:40:30 lightning2 pqmon[15276]: 32657 1 64998 400001480
>> > >59835 4 37820 1592 2717
>> > >Jul 23 20:41:30 lightning2 pqmon[15276]: 32764 0 64892 400003072
>> > >59835 4 37820 0 2737
>> > >Jul 23 20:42:30 lightning2 pqmon[15276]: 32956 1 64699 399985120
>> > >59835 4 37820 17952 2752
>> > >
>> > >The queue is just about filled up by 20:22, and that's when we see the
>> > >problems start.
>> > >
>> > >I experimented with running pqexpire, running it @ 30 second intervals
>> > >to keep only the last 30 minutes of data. That cleared the
>> > >queue, but I then found that each time pqexpire ran corresponded almost
>> > >to the second to frame loss errors in the ingestor program.
>> > >
>> > >So it seems to me that the product queue cleanup process, whether it
>> > >is run "automatically" in the modern LDM, or "the old way" using
>> > >pqexpire, slows up pqing reading from the named pipes just enough
>> > >so it can't keep up with the main ingestor program. By the time
>> > >the data is read from the pipe, some frames have lost their
>> > >"window of opportunity" to get ingested.
>> > >
>> > >Any ideas as to how I might be able to solve this problem
>> > >would be much appreciated. I am sure that this has to have
>> > >been done before by the outfits that use a Linux box to
>> > >ingest data via the LDM.
>> > >
>> > >For what it's worth, we have the same problem on a much older
>> > >PIII 600 MHz system running RH 6.1.
>> > >
>> > >Many thanks . . .
>> > >
>> > >--Kevin
>> > >
>> > >______________________________________________________________________
>> > >Kevin Tyle, Systems Administrator **********************
>> > >Dept. of Earth & Atmospheric Sciences address@hidden
>> > >University at Albany, ES-235 518-442-4571 (voice)
>> > >1400 Washington Avenue 518-442-5825 (fax)
>> > >Albany, NY 12222 **********************
>> > >______________________________________________________________________
>> > >
>> >
>> > **************************************************************************
>> > Unidata User Support UCAR Unidata Progr
>> > (303)497-8643 P.O. Box 30
>> > address@hidden Boulder, CO 803
>> > --------------------------------------------------------------------------
>> > Unidata WWW Service http://my.unidata.ucar.edu/content/suppor
>> > **************************************************************************
>> >
>>
>