This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Hi Doug, Mike is on travel at the moment, so I will jump in and try to answer your question. re: > Disk I/O seems to be one of the larger demands when memory in > inadequate. Yes. That is why we recommend significantly more memory than will hold the size of the LDM queue. On our idd.unidata.ucar.edu cluster we typically run an 10-12 GB queue on machines that have 16 GB of physical RAM. > Currently I've got the queue set at 8GB, with 16 GB of physical > memory available on our SUN system. That should be adequate, but it will depend on what processing other than the LDM is being done. > Products (individual GRIB-2 messages) are processed out of the queue > to local disk, then larger files are built > from these pieces using the cat command from local disk to the SAN > disk. When the queue is set larger, > pqact processing gets way behind. Is this because swap memory needs > to be used for the queue? It is possible that you are seeing a swap-related issue when the queue is set to be larger than 8 GB. This would be my best guess as to what is occurring _if_ all other parameters are kept fixed (meaning that you did not change your processing) when you changed the size of your LDM queue. I am not sure if Mike (or someone else here including me) has commented to you and other TIGGE folks that we migrated to use of a Web 100 kernel on our Linux LDM/IDD platforms. The Web 100 kernel provides for dynamic management of TCP buffer sizes so that more space is allocated for those connections that need it AND less is allocated for those that do not. This feature allows us to run our idd.unidata.ucar.edu cluster nodes with significantly larger LDM queues than were possible using the stock kernel in the Fedora Core releases. Mike told me, however, that he read that the Web 100 tuning capability has been built into the most recent Fedora Core distributions, so switching to a Web 100-specific kernel may no longer be necessary. Please note that we can not comment on if this is the case in distributions other than Fedora Core, so the comment about Web 100 tuning in SuSE (or Debian, Ubuntu, Slackware, Knoppix, Gentoo, etc.) ditributions may not apply. One thing I did not notice in Mike's system recommendations to you was the need to purchase 64-bit systems. This may seem obvious when talking about 32 GB of system RAM and a 20 GB LDM queue, but I want to make sure that we are all aware of this requirement. Finally, we have been having discussions here at the UPC about the impact of the number of processors that an LDM ingest system has. While it might seem like more processors are better, there is an issue with LDM queue write lock contention. Our experience on our old Sun SPARC IDD toplevel relay node, thelma.ucar.edu, hinted that upping the number of processors was not necessarily a good thing. Experience on that machine also showed that it was advantageous to put "collector" machines in "front" of the relay nodes so that they (the front-end machines) made the requests for data from the upstream sources, and the back-ends did the data relay. We use two front-end machines to redundantly collect the data from upstream IDD nodes and then feed the data in single connections to each of our cluster back-end machines. This cuts the number of processes contending for write locks from about 17-20 down to 2. This configuration works very well for our cluster where the back-ends each can support a large number (we have seen over 200 on each node with no introduction of latency from overload) of connections to downstream machines. Since the connection between the "collectors" and "back-ends" is Gbps, there is little, if any, need for large LDM queues on the collectors -- we typically run the collector machines with 2-6 GB queues mainly since we want to be able to bring up new nodes, or restart existing nodes after system upgrades, etc. and have their queues populated with as close to an hour's worth of data as possible. The point I am trying to make in the above somewhat rambling narrative is that we believe that a single machine approach for TIGGE may not scale as more sites start sending ensemble data to the archive centers. Our recommendations to the CMA about using more than one node of their cluster for TIGGE collection and dissemination activities is based on our own experience in creating our IDD relay cluster and in our trying to predict the load that could/will be added to the nodes running at the TIGGE archive centers. It is my opinion that single machine implementations are likely to not be sufficient for the work needed. We can/should discuss these issues at more length before you/ECMWF/CMA embarks on the next round of equipment purchases. Cheers, Tom **************************************************************************** Unidata User Support UCAR Unidata Program (303) 497-8642 P.O. Box 3000 address@hidden Boulder, CO 80307 ---------------------------------------------------------------------------- Unidata HomePage http://www.unidata.ucar.edu **************************************************************************** Ticket Details =================== Ticket ID: XLU-757491 Department: Support IDD TIGGE Priority: Normal Status: Closed