This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
>To: address@hidden >From: Tom McDermott <address@hidden> >Subject: LDM: queue problem >Organization: UCAR/Unidata >Keywords: 200010301740.e9UHeN411741 Tom, On 30 Oct, you wrote: > I'm getting a whole bunch of messages about 'pq_del_oldest: conflict' and > 'comings: pqe_new: Resource temporarily unavailable' followed by > disconnects from 'snow' which is the Cornell ldm server (see below for > samples). I'm not sure if this indicates a problem with the queue on our > host or theirs. I'm running ldm 5.1.2 on a sparc solaris 2.7 host. and Anne replied: > > We're developing a theory that we will be testing this afternoon. > ........ > > I'll let you know the results of our testing later today. > > I'd be interested in hearing the results. This turned out not to be a problem with either the upstream or downstream queue, but a symptom that appears when pqact cannot keep up with the product streams. You can determine how far behind pqact is currently running by: - Run "ldmadmin watch" or "pqutil -w" for a while, capturing the output in a file (this tells when products are getting inserted into the product queue). Here is an example of how to do this with pqutil in csh syntax: % pqutil -w ~ldm/data/ldm.pq < /dev/null > & /tmp/pqutil.log & - Turn on verbose logging for pqact by finding its process ID and sending it a USR2 signal with "kill". - Find a product ID that appeared in the "ldmadmin watch" output and find the same product in the "pqact" verbose logging output. Note that if pqact has fallen way behind, the same product ID may not appear for several hours in the pqact output, but each product should appear in the exact same order as in the ldmadmin output, even if you don't do anything with the product in pqact.conf. - Take the difference between the time stamps for the product in the pqact log and the "ldmadmin watch" log; that's how far pqact is running behind. If it's close to the age of the oldest product in the queue, as shown by the last column in the output of "pqmon", then you will be getting the "pq_del_oldest conflict:" messages. - Remember to turn off verbose logging in pqact by sending it another couple of USR2 signals to cycle into debug logging and then default silent logging. Several problems can cause pqact falling behind, including: - Requesting pqact to write a lot of files. For example, when we turn on filing all WSI products, each in a separate file using the "FILE -close" action, eventually pqact falls further and further behind with all the I/O activity and when it finally is acting on the oldest product in the queue at the same time that space is needed for new products, the 'pq_del_oldest: conflict' message results. Note that when pqact writes a file in a deeply nested directory structure, it has to open and read each directory in the path to that file, so there's lots more I/O going on than just writing a single file. We have seen clusters of as many as 200 products per second, though the average we're handling is only about 16 products per second. - Running pqact on a system with slow disks, or writing lots of files on remotely mounted systems. This appears to have been our problem, since we only started seeing the pq_del_oldest conflicts after our RAID disks went out and we temporarily replaced them with slower disks, then started filing all the WSI products. - Running pqact on an overloaded or under-configured system. You can tell if this is happening with the normal commands for monitoring system load. pqact can invoke lots of decoders in parallel, so a tool like "top" can let you see what kind of load this is putting on the system. Some decoders take lots of memory, so its possible that virtual memory thrashing could result from lots of decoders on a system without much memory. - Having too many pqact rules in the pqact.conf file. Each product ID is sequentially matched against every pattern in the pqact.conf file before the next product is retrieved from the queue, so having hundreds or thousands of pattern-action entries could slow down the processing of each product enough to make pqact fall behind the data feeds. If this turns out to be a problem, we can provide a fix by adding some extra action syntax that means "this is the last thing I want to do with this product so skip all the remaining patterns-action rules and start over with the next product". - Having not enough open-file and running-decoder slots for pqact. To avoid opening and closing lots of files and starting up new decoders for each product, pqact has a stack of open file descriptors for the most recently used files and pipes to decoders. The default number of these is MAXENTRIES, a macro defined in pqact/filel.c and currently set to 32. You may need to increase this and recompile, if pqact appends to lots of files and sends data to lots of different decoders. - Having a product queue that's too small, so there is insufficient elasticity for pqact when it has too much to do I'd be interested if you think any of these apply in your situation. I don't think there's much that can be done to make pqact more efficient currently, since it uses fast algorithms for pattern matching and tries to use file descriptors efficiently to avoid unnecessary process creation or file opening/closing. But if you profile pqact and find a bottleneck that's slowing it down, we'd be interested in hearing about it. --Russ _____________________________________________________________________ Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu