This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Unidata Support wrote: > > ------- Forwarded Message > > >To: Unidata Support <address@hidden> > >From: David Wojtowicz <address@hidden> > >Subject: Re: 20020206: large CONDUIT latencies at UIUC > >Organization: UCAR/Unidata > >Keywords: 200202081436.g18Ea8x27570 > > We're still experiencing very large peaks in latencies on NMC2 > to flood.atmos.uiuc.edu > > I am wondering if the machine itself could be the bottleneck. > The load avg gets up fairly high during portions of the day (9+!) > and doesn't drop below about 3 ever. The load appears entirely > due to LDM relay activity. (It does not run pqact or any other time > consuming process and the load drops to near zero when the LDM is stopped) > It is servicing several NMC2 and a good number of NNEXRAD|FNEXRAD downstream > requests. It is a 400Mhz PC running Linux, 512MB dedicated only to LDM > relay. > > I've seen somewhat high loads on other machines running LDM and servicing > lots of requests. How can one tell at which point the machine has become > the bottleneck? If so, what is the recommended capacity to be able to > handle this better? > > -- > | David Wojtowicz, Sr. Research Programmer > | Department of Atmospheric Sciences Computer Services > | University of Illinois at Urbana-Champaign > | email: address@hidden phone: (217)333-8390 > > ------- End of Forwarded Message Hi David, That does sound like a high load, especially for a relay-only machine. We've always considered rpc.ldmd to be relatively easy on the CPU, as opposed to pqact and any decoders that are also running. But CONDUIT is a bear. How many sites are actually connecting to you? Of those, how many are requesting all or part of CONDUIT? And, how many rpc.ldmds are running? Here, at Unidata, I can see how timely products are arriving at your downstream sites by running 'notifyme' to them. For example, according to our site contact list (which isn't always accurate) I see that climate.geog.udel.edu should be feeding from you. Here are two products as they arrive at squall: Feb 08 23:48:21 notifyme[15092]: 177 20020208234820.046 IDS|DDPLUS 448 SRUS56 KWOH 082341 /pRRSMRX Feb 08 23:48:21 notifyme[15092]: 1052 20020208234820.048 IDS|DDPLUS 449 SRUS74 KWOH 082341 /pRRSOUN Feb 08 23:48:21 notifyme[15092]: 252 20020208234820.050 IDS|DDPLUS 450 SRUS70 KWOH 082341 /pRRSLIX Feb 08 23:48:21 notifyme[15092]: 177 20020208234820.052 IDS|DDPLUS 451 SXUS44 KWOH 082341 /pRRSOAX and as they arrive at climate.geog.udel.edu: Feb 08 23:48:21 notifyme[15091]: 177 20020208234820.046 IDS|DDPLUS 448 SRUS56 KWOH 082341 /pRRSMRX Feb 08 23:48:21 notifyme[15091]: 1052 20020208234820.048 IDS|DDPLUS 449 SRUS74 KWOH 082341 /pRRSOUN Feb 08 23:48:22 notifyme[15091]: 252 20020208234820.050 IDS|DDPLUS 450 SRUS70 KWOH 082341 /pRRSLIX Feb 08 23:48:22 notifyme[15091]: 177 20020208234820.052 IDS|DDPLUS 451 SXUS44 KWOH 082341 /pRRSOAX (Interesting that the PIDs are within 1 of each other.) Assuming both squall's and udel's clocks are accurate, there is at most a one second delay between the two. Assuming that all squall rpc.ldmds are getting equal resources, it doesn't look like squall is a bottleneck, at least not now while no CONDUIT data is arriving. I could find out when the various CONDUIT surges occur, and do some more testing if you like. Of course, this approach includes time for a product to travel across the network, which should not be counted in squall's contribution to a product's latency per se. Or, our stats page give some indication of how things are going. Here's the page for FOS routing: http://www.unidata.ucar.edu/projects/idd/status/idd/fosTopo.html. There are similar pages for MCIDAS and a few others, but not CONDUIT. But, I think that CONDUIT handling would impact other feeds to some degree. From the FOS page, the delta between squall latencies and downstream latencies is pretty small, except in a few cases. Since it's only a few cases of bad latencies, that implies that the connection is more likely the culprit. These statistics can be an hour old. What I think you're asking is how to measure the delay that is being accrued at your machine and only at your machine. I think that would be possible except for a bug in rpc.ldmd. On my own machine, I put my inbound rpc.ldmd in debug mode. This lists product arrival time and signature for every product. Then I also tried to put an outbound feed that relays the same products in debug mode, but this regularly crashes. (This is a known bug. I think that the buffer in which the output is placed is too small, resulting in a segmentation violation. Hopefully it will be fixed in 5.1.5) If I could successfully do that, I could track how long it takes a particular product to be relayed to all downstream sites. To my knowledge, we've never done such a calculation. But, in theory anyway, it could be done. Regarding recommended capacity, we make some general recommendations about memory, disk speed, and CPU speed based on how much data is expected to pass through a site and number of downstream sites. We identify sites that are overloaded by "unsatisfactory" latencies at downstream sites or if their administers are having trouble with responsiveness. Are you concerned that you're not serving your downstream sites properly? Or, are there things you would squall to be able to perform that it can't? Is it not responsive enough? I can talk with our sys admin about possible improvements in the OS configuration. Or, if the burden is too high we can try reshuffling the topology. Also, we have had some success with CONDUIT latencies in particular by splitting the feed up into multiple connections. The cost is more rpc.ldmds on both the sending and recieving end, so it affects other sites besides your own. But the net effect was a big improvement. This could be a possibility for you. Let me know if you'd like to pursue this further, or if you have any further questions. It was a pleasure to meet you at AMS and to learn how to say your name! Anne -- *************************************************** Anne Wilson UCAR Unidata Program address@hidden P.O. Box 3000 Boulder, CO 80307 ---------------------------------------------------- Unidata WWW server http://www.unidata.ucar.edu/ ****************************************************