[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[TIGGE #LGY-600646]: Re: Missing fields from CMA
- Subject: [TIGGE #LGY-600646]: Re: Missing fields from CMA
- Date: Tue, 24 Apr 2007 00:12:41 -0600
YangXin, Manuel, Baudouin, and Doug:
The following is a bulleted-form outline of the discussions that were
held here in Unidata last Friday and today concerning the data reception
efforts at the CMA.
Unidata TIGGE Data Relay Review 20070420
I. Comments on CMA setup from Baudouin, YangXin, etc.:
1. Port 388 traffic is not volume limited (commonly known as packet shaped)
by CMA
2. FTP has been used to send large volumes of data from CMA to ECMWF
3. There are two large users of the 100 Mbps link from CMA to CSTNET, those
being FTP and LDM. FTP uses 0.75 GB/hr for 2 hours 4 times a day.
II. What have learned from our investigations:
1. CMA LDM/IDD system setup:
* 3 GB LDM queue which holds approx. 400 seconds of data
* 4 GB RAM
* 4 x 3 Ghz dual-core, 64-bit Xeon processors
* RedHat Enterprise 4.0, 2.6.9-42 kernel
* 11 redundant feed requests (10 + one for manifest) to 2 sites,
ECMWF and NCAR
o ECMF and EGRR products
* 11 feed requests to NCAR by IP address
o KWBC products
* high number of PERL script "resend" requests (up to 80 running
concurrently) invoked from pqact on receipt of "missing" product
identifier products sent by ECMWF. Each of these invokes pqinsert,
increasing number of queue write locks and decreasing residency of
real-time data.
2. ECMWF was pacing BABJ data (FTPed from CMA) into their LDM queue to
backfill archive
3. Port 388 traffic is being packet shaped
* are able to transfer 30-40 times the amount of data using port 8080
(NB: don't know if port 8080 traffic is also being shaped; if it is
it is being done so less strictly than that for port 388)
4. Data Processing is being done on LDM relay machines at all TIGGE centers
5. On various days at various times we have seen high packet loss on
CMA-to-NCAR link (iperf tests)
6. CMA can receive ECMWF products faster from NCAR than directly from ECMWF
III. What we need to help troubleshoot data transfers to/from CMA
1. Bandwidth usage statistics (time series plots of network bandwidth use)
* GLORIAD website (http://www.gloriad.org) has some usage plots, but
much of the website has not been updated since 2004 so information
is suspect
2. Other (list will likely evolve as we learn more)
IV. Implications of what has been observed
1. Small LDM product queue on CMA machine:
* lots of 2nd trip products (products ingested more than once) received
from ECMWF: It is likely that the upstream LDM processes at ECMWF
are reading products from the older end of the queue because the
processes are using the ALTERNATE transfer-mode. The downstream LDM
processes at CMA are probably requesting those products because
of the small product-queue at CMA. This is the likely cause of
CMA appearing to receive more data than ECMWF injects.
2. Packet Shaping on Port 388:
* LDM can not be used out-of-the-box
* could be extended (by whom?) to other ports as high usage continues
(according to the GLORIAD website, the TIGGE data movement dominates
the traffic on GLORIAD links)
3. Pacing of CMA data into ECMWF LDM queue:
* CMA data lulls (periods when no data is received at the CMA) was likely
caused by there being no new data inserted into ECMWF queue that
would otherwise flow to CMA (i.e., CMA requests for data would (and
should)
not include their own (aka BABJ) data).
V. UPC Recommendations
1. CMA
* increase system RAM to _at least_ 16 GB
* increase LDM queue size to _at least_ 12 GB (dependent on addition of
RAM)
* remove redundant feed requests (at least until LDM queue is large
enough to detect and reject redundant data)
* contact CSTNET and to have packet shaping for port 388 found and removed
* setup a connection between Unidata and CSTNET
* run UPC's "uptime" script so that we can view time-series plots of
operational parameters (we can install this monitoring tool whenever
permission is granted)
* possibility: do not attempt to process "missing" requests from other
TIGGE centers to re-insert data into queue until data flow issues have
been resolved. Done by invoking PERL script from pqact to pace data into
local queue (also shortens residency time in queue)
2. ECMWF and NCAR
* increase LDM queue size to at least 12 GB
o this will require in-depth investigation of problems seen at ECMWF
when using a queue larger than 4 GB
* install a development system on the ECMWF machine currently being used
to ingest and relay data
If memory serves correctly, the size of the LDM product queue used during
initial throughput testing from ECMWF to NCAR in January of 2006 was
much larger than what is currently being used at ECMWF. Since the
operating
system on the current ECMWF LDM/IDD machine is the same as what was being
used during the original tests, and since the configuration of the
machine
(RAM, etc.) is more-or-less the same as the machine that was used for
the original tests, we are at a loss for why the current machine performs
so poorly with LDM queues greater than 4 GB. Having the ability to
build the LDM from source on the machine running the relay would be
an invaluable tool for troubleshooting the small queue restriction.
* agree that only one site (ECMWF or NCAR, not both) should request data
from the CMA --- since CMA's queue only holds 400 seconds, and port 388
throughput is limited, it is unlikely that either center will be able to
transfer data successfully
3. All centers
* offload data processing to machine(s) other than the one(s) being used
for data transfers
* separate ingest and send processing (LDM queue residency time for
ingested data is being affected by insertion of local products)
VI. Future Considerations
1. TIGGE topology review and possible redesign to accommodate more
participants (e.g., Australian BoM, Brazil's INPE/CPTEC, etc.)
2. TIGGE product resend review and possible redesign
VII. Comments
1. Problems seen at CMA are likely caused by:
* attempts to ingest data through a port whose traffic is being
artificially
limited (packet shaped). Successful demonstration of ingesting through
port 8080 reinforces the notion that efforts must be made to locate the
source of the packet shaping and lobby to have it removed.
* attempts to ingest high volumes of data into a too small LDM product
queue while inserting local data into the same queue for transmit to
downstream sites is most likely the cause for receipt of 2nd trip
products
2. As more sites participate in TIGGE, the problems currently being seen will
likely grow. Segregating ingest (product request from upstream sites)
activities from feed (insertion of local products into LDM queue for
transmit to downstream sites) should mitigate these problems.
We welcome discussion of any/all of the comments made above. Please let us
know if you would like a detailed explanation of any comment or recommendation
made.
Cheers,
Tom
****************************************************************************
Unidata User Support UCAR Unidata Program
(303) 497-8642 P.O. Box 3000
address@hidden Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage http://www.unidata.ucar.edu
****************************************************************************
Ticket Details
===================
Ticket ID: LGY-600646
Department: Support IDD TIGGE
Priority: Normal
Status: Open