This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
current configuration. sysrq is for us for console "reset / debug " access to the box. # increase the amount of memory associated with input and output socket buffers: net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # recommended to increase this for 1000 BT or higher net.core.netdev_max_backlog = 2500 # don't cache ssthresh from previous connection net.ipv4.tcp_no_metrics_save = 1 # Sys RQ kernel.sysrq = 1 # LDM Tuning Setting by Chi # Setting SHMMAX Parameter 4 GB kernel.shmmax = 4294967296 # getconf PAGE_SIZE kernel.shmmni = 4096 kernel.shmall = 2097152 Mike Schmidt wrote: > Chi, > > Ideally, all of the LDM connections will stay in the ESTABLISHED state. > When connections are in TIME_WAIT, they are in the process of closing > down and disconnecting. If connections are continually in TIME_WAIT, > that's usually an indication of an underlying problem. > > With the volume of data and distance (latency) of the network connections > you have between Illinois, Wisconsin, and Unidata, you will want to have > done some TCP stack tuning. Here are values we use on our cluster nodes; > > net.core.wmem_max = 8388608 > net.core.rmem_max = 8388608 > net.ipv4.tcp_wmem = 4096 2000000 8388608 > net.ipv4.tcp_rmem = 4096 524288 8388608 > net.ipv4.tcp_adv_win_scale = 7 > net.ipv4.tcp_moderate_rcvbuf = 1 > > Let me know if you have questions. > > mike > > On Jun 21, 9:11am, Steve Chiswell wrote: > >> Subject: Re: Top level CONDUIT relay >> >> Chi, >> >> LDM memory maps theproduct queue. It does not use chared memory. I would >> guess that the shared memory segment you see in use is by the operating >> > system > >> or window manager. >> >> The parameter that can be tuned to improve LDM performance is the tcp stack >> size. >> >> The netstat listing that you showed with several LDM connections in time_wait >> may mean somethng to Steve Emmerson and/or Mike Schmidt, so I'll see if they >> have any imput as well as suggestions on TCP tuning. >> >> Steve >> >> >> >> On Thu, 21 Jun 2007, Chi.Y.Kang wrote: >> >> >>> Yes, I made the change to the LDM servers to test the shared memory >>> configuration. >>> >>> # Setting SHMMAX Parameter 4 GB >>> kernel.shmmax = 4294967296 >>> # getconf PAGE_SIZE >>> kernel.shmmni = 4096 >>> kernel.shmall = 2097152 >>> >>> However, this doesn't explain the performance relief because... ldm >>> doesn't seem to be using shared memory, or at least not listed on the >>> table. Mr Cano thought LDM might be using this. >>> >>> ldm1:~$ ipcs -a >>> >>> ------ Shared Memory Segments -------- >>> key shmid owner perms bytes nattch >>> status >>> 0x00000000 0 root 600 3976 4 >>> dest >>> >>> ------ Semaphore Arrays -------- >>> key semid owner perms nsems >>> >>> ------ Message Queues -------- >>> key msqid owner perms used-bytes messages >>> >>> >>> Justin Cooke wrote: >>> >>>> Chi, >>>> >>>> Has anything at all changed on ldm1 since yesterday? Starting at 04Z >>>> the feed on node6 improved dramatically, all other subscribers to ldm1 >>>> also noticed improved performance. >>>> >>>> Justin >>>> >>>> Steve Chiswell wrote: >>>> >>>>> Justin, >>>>> >>>>> I noticed that the feeds from ldm1 dropped as you said. Do you know >>>>> if anything >>>>> changed related to that machine? >>>>> >>>>> I can add daffy back to ldm1 and see if things maintain their >>>>> performance, but >>>>> will wait to find out if any changes were made? Since ldm2 is still >>>>> lagging, >>>>> seems like it is not a network wide issue? >>>>> >>>>> Steve >>>>> >>>>> On Thu, 21 Jun 2007, Justin Cooke wrote: >>>>> >>>>> >>>>> >>>>>> Steve, >>>>>> >>>>>> Looking at the graphs it appears that transfers improved greatly after >>>>>> 04Z today. I did a netstat on ldm1 and I still see where atm and flood >>>>>> are subscribing to it, same as yesterday. >>>>>> >>>>>> Although looking at the latency graphs you provide it looks like those >>>>>> subscribing to ldm2 are still seeing delays. >>>>>> >>>>>> >>>>>> > http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+atm.cise-nsf.gov > >>>>>> Justin >>>>>> >>>>>> Steve Chiswell wrote: >>>>>> >>>>>> >>>>>>> Justin, >>>>>>> >>>>>>> I am receiving the stats from node6: >>>>>>> Latency: >>>>>>> >>>>>>> > http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+node6.woc.noaa.gov > >>>>>>> Volume: >>>>>>> >>>>>>> > http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT+node6.woc.noaa.gov > >>>>>>> The latency there to ldm1 is climbing on the initial connection, and >>>>>>> will start off by catching up on the last hours worth of data in the >>>>>>> upstream queue. After that, we can see what the latency is doing. >>>>>>> >>>>>>> Steve >>>>>>> >>>>>>> On Wed, 2007-06-20 at 12:43 -0400, Justin Cooke wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Steve and Chi, >>>>>>>> >>>>>>>> I tried to ping rtstats.unidata.ucar.edu but was unable to. >>>>>>>> >>>>>>>> Chi would you be able to set up a static route from node6 to >>>>>>>> rstats.unidata.ucar.edu like Steve mentions? >>>>>>>> >>>>>>>> I actually am unable to connect to ncepldm.woc.noaa.gov either. >>>>>>>> However >>>>>>>> I did set up a feed to "ldm1" and am receiving CONDUIT data >>>>>>>> currently. >>>>>>>> >>>>>>>> Steve how tough would it be to do the pqact step you mention and >>>>>>>> to get >>>>>>>> the stats reports from those if Chi is unable to get the static route >>>>>>>> going? >>>>>>>> >>>>>>>> Thanks for all the help, >>>>>>>> >>>>>>>> Justin >>>>>>>> >>>>>>>> On Jun 20, 2007, at 12:16 PM, Steve Chiswell wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Justin, >>>>>>>>> >>>>>>>>> Is that box capable of sending stats to our rtstats.unidata.ucar.edu >>>>>>>>> host? >>>>>>>>> Eg, is it allowed to connect outside your domain? >>>>>>>>> >>>>>>>>> The ldm won't need to run pqact to test out the throughput and >>>>>>>>> netwrok, >>>>>>>>> but will need ldmd.conf lines: >>>>>>>>> >>>>>>>>> EXEC "rtstats -h rtstats.unidata.ucar.edu" >>>>>>>>> request CONDUIT ".*" ncepldm.woc.noaa.gov >>>>>>>>> >>>>>>>>> The pqact EXEC action can be commented out. The request >>>>>>>>> line will start the feed to ncepldm which flood.atmos.uiuc.edu is >>>>>>>>> pointing to, and showing high latency. If you are able to feed from >>>>>>>>> ncepldm >>>>>>>>> without the latency that outside hosts are showing, then it would >>>>>>>>> isolate the >>>>>>>>> problem further to the border of your network to the outside. If >>>>>>>>> you do >>>>>>>>> show similar latency, then it would either be the LDM configuration >>>>>>>>> itself, or the local >>>>>>>>> router that the machines are on. >>>>>>>>> >>>>>>>>> If you are able to send rtstats out to us, then we can monitor >>>>>>>>> stats on >>>>>>>>> our web pages. >>>>>>>>> Your network might require a static route be added for sending that >>>>>>>>> outside your domain (that would something your networking folks >>>>>>>>> would >>>>>>>>> know). The rtstats sends >>>>>>>>> a small text report about every 60 seconds, so not a lot of traffic. >>>>>>>>> >>>>>>>>> If you can't configure your host to send rtstats, then we could >>>>>>>>> create >>>>>>>>> q >>>>>>>>> pqact.conf action to file the .status reports and calculate the >>>>>>>>> latency >>>>>>>>> from those. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Steve >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, 2007-06-20 at 12:03 -0400, Justin Cooke wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Steve, >>>>>>>>>> >>>>>>>>>> If you provide us a pqact.conf I can have the box chi set up to >>>>>>>>>> feed >>>>>>>>>> off of ldm1 and see how its latencies are. >>>>>>>>>> >>>>>>>>>> Justin >>>>>>>>>> On Jun 20, 2007, at 11:36 AM, Steve Chiswell wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Justin, >>>>>>>>>>> >>>>>>>>>>> Since the change at 13Z by dropping daffy.unidata.ucar.edu out >>>>>>>>>>> of the >>>>>>>>>>> top level nodes the ldm2 feed to NSF is showing little/no >>>>>>>>>>> latency at >>>>>>>>>>> all. The ldm1 feed to NSF which is connected using the >>>>>>>>>>> alternate LDM >>>>>>>>>>> mode is only devivering the .status messages its creates as all >>>>>>>>>>> the >>>>>>>>>>> other products are duplicates of products already being >>>>>>>>>>> received from >>>>>>>>>>> LDM2 and that is showing high latency: >>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc? >>>>>>>>>>> CONDUIT+atm.cise-nsf.gov >>>>>>>>>>> >>>>>>>>>>> This configuration is getting data out to the community at the >>>>>>>>>>> moment. >>>>>>>>>>> The downside here is that it puts a single point of failure at >>>>>>>>>>> NSF in >>>>>>>>>>> getting the data to Unidata, but >>>>>>>>>>> I'll monitor that end. >>>>>>>>>>> >>>>>>>>>>> It seems that ldm1 is either slow, or it is showing network >>>>>>>>>>> limitations >>>>>>>>>>> (since >>>>>>>>>>> flood.atmos.uiuc.edu is feeding from ncepldm which is apparently >>>>>>>>>>> pointing to ldm1, there is load on ldm1 besides the NSF feed. >>>>>>>>>>> LDM2 is >>>>>>>>>>> feeding both NSF and idd.aos.wisc.edu (and Wisc looks good >>>>>>>>>>> since 13Z >>>>>>>>>>> as >>>>>>>>>>> well) so it is able to >>>>>>>>>>> handle the throughput to 2 downstreams, but adding daffy as the >>>>>>>>>>> 3rd >>>>>>>>>>> seems to >>>>>>>>>>> cross some point in volume of what can be sent out. >>>>>>>>>>> >>>>>>>>>>> Steve >>>>>>>>>>> >>>>>>>>>>> On Wed, 2007-06-20 at 09:45 -0400, Justin Cooke wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Thanks Steve, >>>>>>>>>>>> >>>>>>>>>>>> Chi has set up a box on the lan for us to run LDM on, I am >>>>>>>>>>>> beginning >>>>>>>>>>>> to >>>>>>>>>>>> get things running on there. >>>>>>>>>>>> >>>>>>>>>>>> have you seen any improvement since dropping daffy? >>>>>>>>>>>> >>>>>>>>>>>> Justin >>>>>>>>>>>> >>>>>>>>>>>> On Jun 20, 2007, at 9:03 AM, Steve Chiswell wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Justin, >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, this does appear to be the case. I will drop daffy from >>>>>>>>>>>>> feeding >>>>>>>>>>>>> directly and instead move it to feed from NSF. That will >>>>>>>>>>>>> remove one >>>>>>>>>>>>> of the top level relays of data having to go out of NCEP and >>>>>>>>>>>>> we can see if the other nodes show an improvement. >>>>>>>>>>>>> >>>>>>>>>>>>> Steve >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, 20 Jun 2007, Justin Cooke wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> Steve, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Did you see a slowdown to ldm2 after Pete and the other sites >>>>>>>>>>>>>> began >>>>>>>>>>>>>> making connections? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Chi, considering steve saw a good connection to ldm1 before the >>>>>>>>>>>>>> other >>>>>>>>>>>>>> sites connected doesn't that point toward a network issue? >>>>>>>>>>>>>> >>>>>>>>>>>>>> All of our queue processing on the diskserver has been running >>>>>>>>>>>>>> without >>>>>>>>>>>>>> any problems so I don't believe anything on that system would >>>>>>>>>>>>>> impacting >>>>>>>>>>>>>> ldm1/ldm2. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Justin >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Jun 20, 2007, at 12:04 AM, Chi Y Kang wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I setup the test LDM server for the NCEP folks to test the >>>>>>>>>>>>>>> local >>>>>>>>>>>>>>> pull >>>>>>>>>>>>>>> from the LDM servers. That should give us some information / >>>>>>>>>>>>>>> network >>>>>>>>>>>>>>> or system related issue. We'll handle that tomorrow. I am a >>>>>>>>>>>>>>> little >>>>>>>>>>>>>>> bit concerned that the slow down all occurred at the some >>>>>>>>>>>>>>> time as >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> ldm1 crash last week. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also, can NCEP also check if there are any bad dbnet queues on >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> backend servers? Just to verify. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Steve Chiswell wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks Justin, >>>>>>>>>>>>>>>> I also had a typo in my message: >>>>>>>>>>>>>>>> ldm1 is running slower than ldm2 >>>>>>>>>>>>>>>> Now if the feed to ldm2 all of a sudden slows down if Pete >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>> sites add a request to it, it would really signal some >>>>>>>>>>>>>>>> sort of >>>>>>>>>>>>>>>> total >>>>>>>>>>>>>>>> bandwidth limitation >>>>>>>>>>>>>>>> on the I2 connection. Seemed a little coincidental that we >>>>>>>>>>>>>>>> had a >>>>>>>>>>>>>>>> show >>>>>>>>>>>>>>>> period >>>>>>>>>>>>>>>> of good connectivity to ldm1 after which it slowed way down. >>>>>>>>>>>>>>>> Steve >>>>>>>>>>>>>>>> On Tue, 2007-06-19 at 17:01 -0400, Justin Cooke wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I just realized the issue. When I disabled the "pqact" >>>>>>>>>>>>>>>>> process >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> ldm2 earlier today it caused our monitor script (in cron, >>>>>>>>>>>>>>>>> every 5 >>>>>>>>>>>>>>>>> min) to kill the LDM and restart it. I have removed the >>>>>>>>>>>>>>>>> check >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> the pqact in that monitor...things should be a bit better >>>>>>>>>>>>>>>>> now. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Chi.Y.Kang wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Huh, i thought you guys were on the system. let me take a >>>>>>>>>>>>>>>>>> look >>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>> ldm2 >>>>>>>>>>>>>>>>>> and see what is going on. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Justin Cooke wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Chi.Y.Kang wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Steve Chiswell wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Pete and David, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I changed the CONDUIT request lines at NSF and >>>>>>>>>>>>>>>>>>>>> Unidata to >>>>>>>>>>>>>>>>>>>>> request data >>>>>>>>>>>>>>>>>>>>> from ldm1.woc.noaa.gov rather than ncepldm.woc.noaa.gov >>>>>>>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>>>>>>> seeing >>>>>>>>>>>>>>>>>>>>> lots of >>>>>>>>>>>>>>>>>>>>> disconnect/reconnects to the ncepldm virtual name. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The LDM appears to have caught up here as an interim >>>>>>>>>>>>>>>>>>>>> solution. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Still don't know the cause of the problem. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Steve >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I know the NCEP was stop and starting the LDM service >>>>>>>>>>>>>>>>>>>> on the >>>>>>>>>>>>>>>>>>>> ldm2 >>>>>>>>>>>>>>>>>>>> box >>>>>>>>>>>>>>>>>>>> where the VIp address is pointed to at this time. how is >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> current >>>>>>>>>>>>>>>>>>>> connection to LDM1? is the speed of the conduit feed >>>>>>>>>>>>>>>>>>>> acceptable? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Chi, NCEP has not restarted the LDM on ldm2 at all >>>>>>>>>>>>>>>>>>> today. But >>>>>>>>>>>>>>>>>>> looking >>>>>>>>>>>>>>>>>>> at the logs it appears to be dying and getting >>>>>>>>>>>>>>>>>>> restarted by >>>>>>>>>>>>>>>>>>> cron. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I will watch and see if I see anything. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Justin >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Chi Y. Kang >>>>>>>>>>>>>>> Contractor >>>>>>>>>>>>>>> Principal Engineer >>>>>>>>>>>>>>> Phone: 301-713-3333 x201 >>>>>>>>>>>>>>> Cell: 240-338-1059 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Steve Chiswell <address@hidden> >>>>>>>>>>> Unidata >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Steve Chiswell <address@hidden> >>>>>>>>> Unidata >>>>>>>>> >>>>>>>>> >>>>>>>>> >>> -- >>> Chi Y. Kang >>> Contractor >>> Principal Engineer >>> Phone: 301-713-3333 x201 >>> Cell: 240-338-1059 >>> >>> >> -- End of excerpt from Steve Chiswell >> > > > -- Chi Y. Kang Contractor Principal Engineer Phone: 301-713-3333 x201 Cell: 240-338-1059