This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
I'm just forwarding this in case you're interested. My earlier reply indicated we thought we had figured out what was going on, but it looks like the cause of the similar problem Tom McDermott is seeing is still a mystery ... --Russ >To: address@hidden >From: Tom McDermott <address@hidden> >Subject: LDM: queue problem >Organization: UCAR/Unidata >Keywords: 200010301740.e9UHeN411741 ------- Forwarded Message Date: Thu, 02 Nov 2000 13:03:27 -0500 From: Tom McDermott <address@hidden> To: Russ Rew <address@hidden> Subject: Re: 20001030: More on queue problem Russ, Thanks a lot for the detailed response. My comments follow below. On Thu, 2 Nov 2000, Russ Rew wrote: > This turned out not to be a problem with either the upstream or > downstream queue, but a symptom that appears when pqact cannot keep up > with the product streams. Because pqact is holding locks on the portion of the queue it hasn't processed yet, and so the ingester processes are unable to delete the oldest entries to make room for new ones? The reason I mentioned the upstream queue (Cornell) in my original message was that I had only seen these messages previously in Cornell's ldmd log web page, never on our system. > You can determine how far behind pqact is currently running by: ....... <instructions deleted> ...... > - Take the difference between the time stamps for the product in the > pqact log and the "ldmadmin watch" log; that's how far pqact is > running behind. If it's close to the age of the oldest product in > the queue, as shown by the last column in the output of "pqmon", > then you will be getting the "pq_del_oldest conflict:" messages. I tried this out just now while the 12Z AVN is coming in: the time stamps are identical. So if a pqact slowdown really was the cause, it's not happening right now. > Several problems can cause pqact falling behind, including: > > - Requesting pqact to write a lot of files. For example, when we > turn on filing all WSI products, each in a separate file using the > "FILE -close" action, eventually pqact falls further and further > behind with all the I/O activity and when it finally is acting on > the oldest product in the queue at the same time that space is > needed for new products, the 'pq_del_oldest: conflict' message > results. Note that when pqact writes a file in a deeply nested > directory structure, it has to open and read each directory in the > path to that file, so there's lots more I/O going on than just > writing a single file. We have seen clusters of as many as 200 > products per second, though the average we're handling is only > about 16 products per second. Possibly, but not likely. I haven't changed any feed requests or pattern-actions recently except to delete those dealing with DIFAX. > - Running pqact on a system with slow disks, or writing lots of files > on remotely mounted systems. This appears to have been our > problem, since we only started seeing the pq_del_oldest conflicts > after our RAID disks went out and we temporarily replaced them with > slower disks, then started filing all the WSI products. Unlikely. I covered this point in my email to Anne on 10/31 which you should have received a copy of. > - Running pqact on an overloaded or under-configured system. You can > tell if this is happening with the normal commands for monitoring > system load. pqact can invoke lots of decoders in parallel, so a > tool like "top" can let you see what kind of load this is putting > on the system. Some decoders take lots of memory, so its possible > that virtual memory thrashing could result from lots of decoders on > a system without much memory. I don't think this describes our situation. Just now: last pid: 15681; load averages: 0.38, 0.36, 0.51 11:55:24 119 processes: 118 sleeping, 1 on cpu CPU states: 79.0% idle, 7.6% user, 11.2% kernel, 2.2% iowait, 0.0% swap Memory: 512M real, 17M free, 124M swap in use, 653M swap free PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 29919 ldm 1 55 0 148M 123M sleep 326:02 1.74% pqact 29918 ldm 1 48 0 147M 108M sleep 211:59 1.11% pqbinstats 15676 ldm 1 56 0 1448K 1196K cpu/2 0:00 1.10% top 12889 ldm 1 58 0 5552K 2292K sleep 94:17 0.91% dchrly 14338 ldm 1 33 0 7316K 2136K sleep 4:20 0.71% dcgrib 29923 ldm 1 58 0 11M 4120K sleep 109:23 0.68% pqact 29920 ldm 1 58 0 155M 110M sleep 69:40 0.44% pqsurf 29925 ldm 1 58 0 147M 115M sleep 48:16 0.30% rpc.ldmd 29935 ldm 1 58 0 147M 113M sleep 63:39 0.25% rpc.ldmd 205 root 4 38 0 2224K 880K sleep 37:34 0.23% ypserv 29924 ldm 1 58 0 147M 89M sleep 56:36 0.22% rpc.ldmd 479 root 1 58 0 1984K 508K sleep 22:01 0.15% fingerd 15583 ldm 1 58 0 1068K 888K sleep 0:00 0.13% mailx 15677 ldm 1 52 2 1660K 1100K sleep 0:00 0.09% ping 311 root 21 54 0 3948K 3220K sleep 11:29 0.08% nscd It is frequently the case that there are multiple decoders executing simultaneously (e.g., when a number of different models are coming in simultaneously, but not enough to even come close to overloading the system. I've tried to write the pattern-actions to reduce the number of simultaneous invocation of decoders to the greatest extent possible. BTW, gettting rid of pqexpire with 5.1.2 reduced system load considerably. > - Having too many pqact rules in the pqact.conf file. I have 393 pattern-actions currently in pqact.conf . I don't think this is enough to cause a problem and anyways this number hasn't changed appreciably recently. > - Having not enough open-file and running-decoder slots for pqact. > To avoid opening and closing lots of files and starting up new > decoders for each product, pqact has a stack of open file > descriptors for the most recently used files and pipes to decoders. > The default number of these is MAXENTRIES, a macro defined in > pqact/filel.c and currently set to 32. You may need to increase > this and recompile, if pqact appends to lots of files and sends > data to lots of different decoders. Possibly, since pqact does append to lots of files and sends data to lots of different decoders, but I imagine it does at most sites. Is there a way to detect if there is a shortage of file descriptors for a given process? > - Having a product queue that's too small, so there is insufficient > elasticity for pqact when it has too much to do I don't think so, although I did reduce the queue size recently from 180 to 150MB, since pqmon was reporting ages of up to 21,000, which I thought was execessive. I checked it while running the pqutil/pqact experiment and it was 5905, which is the lowest I've ever seen it, but I would think it's still more than enough. But I suspect this was the culprit in Cornell's case, since their queue is too small. They're still running ldm 5.0.x and they get tons of 'Deleting oldest to make space' messages in their logs when the models come in. > I'd be interested if you think any of these apply in your situation. > I don't think there's much that can be done to make pqact more > efficient currently, since it uses fast algorithms for pattern > matching and tries to use file descriptors efficiently to avoid > unnecessary process creation or file opening/closing. But if you > profile pqact and find a bottleneck that's slowing it down, we'd be > interested in hearing about it. No, I don't think so either; pqact seems to be very efficient. Really, the only thing that keeps coming to mind is the coincidence of installing those difax conversion scripts on Thurs. PM and then the first instance pops up on Fri. AM. It's hard for me to believe they could have bogged down the system enough to slow down pqact, but maybe they did. I have subsequently moved the scripts to another host, so we can eliminate this as a possibility in any future incident. Unfortunately, I was not on the premises when these 2 incidents occurred, but if it happens again and I am around, I can look at some of the things you mention in your message and then get back to you with something more concrete. Thanks again. Tom ------------------------------------------------------------------------------ Tom McDermott Email: address@hidden System Administrator Phone: (716) 395-5718 Earth Sciences Dept. Fax: (716) 395-2416 SUNY College at Brockport ------- End of Forwarded Message