This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
On Thu, 31 May 2001, Arthur A. Person wrote: > On Thu, 31 May 2001, Anne Wilson wrote: > > > "Arthur A. Person" wrote: > > > > > > On Thu, 31 May 2001, Anne Wilson wrote: > > > > > > > And, what's the history on the > > > > queue size? > > > > > > I believe I started running the ldm with a queue size of 2GB around May > > > 18, but with only a couple of rpc's in test mode. I then added my > > > downstream sites the end of last week, and over the weekend (Sunday) the > > > system choked with the thrashing. I came in and power-cycled, re-made the > > > queue at 300MB and restarted in hopes I would get through the rest of the > > > long weekend okay, and did. At this point, my swap space was a 1.5GB > > > partition and I began thinking I perhaps needed swap space larger than my > > > queue size if the queue is mapped, so I added a swap 2GB swap file to the > > > system and then restarted the ldm again with a re-made 2GB queue and > > > restarted on Tuesday. This morning I noticed the system was thrashing > > > again, don't know exactly when it started. > > > > > > > Do you normally run with less than 300MB, and is that what > > > > you're doing now? > > > > > > I'm running with a 2GB queue now, which is what I want to run with. > > > Actually, I'd make it bigger, but Linux (ldm?) doesnt' seem to support it. > > > > > > > Please see below for a comment about this. > > > > > > How many rpc.ldmd processes are currently running? (I hope it's > > > > responsive enough to tell.) > > > > > > Perhaps this is revealing... there's a bunch of rpc's running, I think > > > more than there should be: > > > > yes, this doesn't look right. Currently you have 78 of these processes > > running. That's 5 more than what you reported to me earlier, and four > > more than when I first logged in. The number seems to be growing. > > > > [ldm@ldm ~/etc]$ ps -ef | grep rpc.ldmd | wc > > 79 789 6380 > > > > The max number of rpc.ldmds you should have is the number of requests to > > "unique" hosts plus the number of allows to "unique" hosts plus one. (I > > qualify "unique" because, as you know, the LDM will group > > requests/allows to the same hosts unless you trick it by using the IP > > address.) You may have fewer rpc.ldmds if your upstreams hosts are > > unavailable or your downstream sites are not connected. Anyway, you > > have way more than your should, based on the entries in your ldmd.conf: > > > > [ldm@ldm ~/etc]$ grep -E "^request" ldmd.conf | wc > > 5 20 228 > > [ldm@ldm ~/etc]$ grep -E "^allow" ldmd.conf | wc > > 36 107 1669 > > > > I'm developing a hypothesis: In looking at the PIDs of the running > > rpc.ldmds and comparing those with the PIDs listed in the log, it looks > > like sysu1.wsicorp.com is connecting a lot more than it's exiting. Take > > a look at this: > > > > [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Connection from" | wc > > 177 1416 12213 > > [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Exiting" | wc > > 121 726 5203 > > [ldm@ldm ~/logs]$ ^sysu1^windfall > > grep windfall ldmd.log | grep -E "Exiting" | wc > > 44 264 2024 > > [ldm@ldm ~/logs]$ ^Exiting^Connection from > > grep windfall ldmd.log | grep -E "Connection from" | wc > > 44 352 3564 > > [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Connection from" | wc > > 18 144 1170 > > [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Exiting" | wc > > 18 108 846 > > > > The stats for windfall and bob are for comparison. You'll see that for > > those two hosts the number of connects and exits are the same. I'd > > expect them to be the same plus or minus one. > > > > I know WSI has their own unique version of the LDM based on a very early > > version. As an experiment, are you able to do without this data for a > > bit? I will conduct my own experiment here on our 7.1 machine, but it > > may take me a little time, as I have to be away for a few hours starting > > soon. > > I was starting to suspect the wsi feed as well. I know they've had a lot > of trouble staying connected here and when I saw all the rpc's, I started > thinking about wsi trying to connect. If their software is old, they > should probably update since their feed has not been as reliable as it > used to be, unless it's just a networking bandwidth problem. I guess I > would have to take that up with them. But the connect/reconnect thing > shouldn't haul my system down either, should it? > > > One other point. With your 2Gb queue, you have lots of data. At the > > time I ran the pqmon command below you had over 10 hours worth of data, > > and it was growing (see the 'age' field - it gives the age of the oldest > > product in the queue). Also, the number of products in the queue is > > also going up, so space is not yet being recycled: > > > > [ldm@ldm ~/data]$ pqmon -i3 > > May 31 17:49:25 pqmon: Starting Up (17268) > > May 31 17:49:26 pqmon: nprods nfree nempty nbytes maxprods > > maxfree minempty maxext age > > May 31 17:49:26 pqmon: 185848 64 326088 2051968120 185848 > > 471 326088 1549296 36372 > > May 31 17:49:29 pqmon: 185868 64 326068 2052103712 185868 > > 471 326068 1549296 36376 > > May 31 17:49:33 pqmon: 185877 64 326059 2052139000 185877 > > 471 326059 1549296 36379 > > May 31 17:49:36 pqmon: 185890 64 326046 2052203688 185890 > > 471 326046 1549296 36382 > > May 31 17:49:39 pqmon: 185901 64 326035 2052238392 185901 > > 471 326035 1549296 36386 > > May 31 17:49:42 pqmon: 185916 64 326020 2052322080 185916 > > 471 326020 1549296 36389 > > May 31 17:49:46 pqmon: 185923 63 326014 2052353264 185923 > > 471 326014 1549296 36392 > > May 31 17:49:49 pqmon: 185938 63 325999 2052437608 185938 > > 471 325999 1549296 36395 > > May 31 17:49:52 pqmon: 185944 63 325993 2052463160 185944 > > 471 325993 1549296 36398 > > May 31 17:49:55 pqmon: 185947 63 325990 2052480008 185947 > > 471 325990 1549296 36402 > > May 31 17:49:59 pqmon: 185952 63 325985 2052525544 185952 > > 471 325985 1549296 36405 > > May 31 17:50:03 pqmon: 185959 63 325978 2052588304 185959 > > 471 325978 1549296 36409 > > May 31 17:50:06 pqmon: 185967 62 325971 2052651936 185967 > > 471 325971 1549296 36412 > > May 31 17:50:09 pqmon: 185977 62 325961 2052717376 185977 > > 471 325961 1549296 36416 > > May 31 17:50:12 pqmon: 185988 62 325950 2052812104 185988 > > 471 325950 1549296 36419 > > May 31 17:50:16 pqmon: 185992 62 325946 2052852920 185992 > > 471 325946 1549296 36422 > > May 31 17:50:19 pqmon: 186002 62 325936 2052912024 186002 > > 471 325936 1549296 36425 > > May 31 17:50:22 pqmon: 186013 62 325925 2053009880 186013 > > 471 325925 1549296 36428 > > May 31 17:50:25 pqmon: 186018 61 325921 2053029616 186018 > > 471 325921 1549296 36432 > > May 31 17:50:29 pqmon: 186031 61 325908 2053061800 186031 > > 471 325908 1549296 36435 > > May 31 17:50:32 pqmon: 186039 61 325900 2053099008 186039 > > 471 325900 1549296 36439 > > May 31 17:50:35 pqmon: 186048 61 325891 2053150176 186048 > > 471 325891 1549296 36442 > > May 31 17:50:39 pqmon: 186059 61 325880 2053246544 186059 > > 471 325880 1549296 36445 > > May 31 17:50:42 pqmon: 186070 61 325869 2053333296 186070 > > 471 325869 1549296 36448 > > May 31 17:50:45 pqmon: 186081 61 325858 2053422336 186081 > > 471 325858 1549296 36452 > > May 31 17:50:49 pqmon: 186095 61 325844 2053506456 186095 > > 471 325844 1549296 36455 > > May 31 17:50:52 pqmon: 186103 61 325836 2053532408 186103 > > 471 325836 1549296 36459 > > May 31 17:50:56 pqmon: 186112 61 325827 2053643864 186112 > > 471 325827 1549296 36462 > > May 31 17:50:59 pqmon: 186118 61 325821 2053755592 186118 > > 471 325821 1549296 36465 > > May 31 17:51:02 pqmon: 186124 61 325815 2053858840 186124 > > 471 325815 1549296 36469 > > May 31 17:51:05 pqmon: 186128 61 325811 2053906992 186128 > > 471 325811 1549296 36472 > > May 31 17:51:09 pqmon: 186139 61 325800 2054017464 186139 > > 471 325800 1549296 36475 > > May 31 17:51:12 pqmon: 186148 61 325791 2054157200 186148 > > 471 325791 1549296 36478 > > May 31 17:51:15 pqmon: 186155 61 325784 2054262720 186155 > > 471 325784 1549296 36481 > > May 31 17:51:19 pqmon: 186162 60 325778 2054333056 186162 > > 471 325778 1549296 36485 > > May 31 17:51:22 pqmon: 186172 60 325768 2054454576 186172 > > 471 325768 1549296 36488 > > May 31 17:51:26 pqmon: 186176 60 325764 2054533992 186176 > > 471 325764 1549296 36492 > > May 31 17:51:29 pqmon: 186185 60 325755 2054675840 186185 > > 471 325755 1549296 36495 > > May 31 17:51:32 pqmon: 186190 60 325750 2054758024 186190 > > 471 325750 1549296 36498 > > May 31 17:51:35 pqmon: 186197 59 325744 2054844960 186197 > > 471 325744 1549296 36501 > > May 31 17:51:36 pqmon: Interrupt > > May 31 17:51:36 pqmon: Exiting > > > > Do you really want to keep that much data? If you have the space and > > everything's working fine, I guess there's no reason not to... This is > > just a FYI. > > Yeh, I know it will hold a lot, but I like lots of data :) As I said, if > I could make the queue even bigger, I would. Space is cheap these days > and I figure as a relay, if someone downstream is down for a bunch of > hours, they can still catch up on the data. > > > Please let me know what you think about the WSI feed. I will be leaving > > here in about 15 minutes, but will give my own test a try later this > > afternoon when I return. > > Maybe I'll try killing off some rpc.ldmd processes and see if things > improve assuming I don't jam the system. Okay, I killed off most or all of the hung rpc's... not sure if it helped much though. Disk may be a little less busy, but it's still pretty busy. Maybe this is still some sort of a queue issue...? Art. Arthur A. Person Research Assistant, System Administrator Penn State Department of Meteorology email: address@hidden, phone: 814-863-1563