[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20001030: More on queue problem
- Subject: Re: 20001030: More on queue problem
- Date: Thu, 02 Nov 2000 09:03:04 -0700
>To: address@hidden
>From: Tom McDermott <address@hidden>
>Subject: LDM: queue problem
>Organization: UCAR/Unidata
>Keywords: 200010301740.e9UHeN411741
Tom,
On 30 Oct, you wrote:
> I'm getting a whole bunch of messages about 'pq_del_oldest: conflict' and
> 'comings: pqe_new: Resource temporarily unavailable' followed by
> disconnects from 'snow' which is the Cornell ldm server (see below for
> samples). I'm not sure if this indicates a problem with the queue on our
> host or theirs. I'm running ldm 5.1.2 on a sparc solaris 2.7 host.
and Anne replied:
> > We're developing a theory that we will be testing this afternoon.
> ........
> > I'll let you know the results of our testing later today.
>
> I'd be interested in hearing the results.
This turned out not to be a problem with either the upstream or
downstream queue, but a symptom that appears when pqact cannot keep up
with the product streams.
You can determine how far behind pqact is currently running by:
- Run "ldmadmin watch" or "pqutil -w" for a while, capturing the
output in a file (this tells when products are getting inserted
into the product queue). Here is an example of how to do this with
pqutil in csh syntax:
% pqutil -w ~ldm/data/ldm.pq < /dev/null > & /tmp/pqutil.log &
- Turn on verbose logging for pqact by finding its process ID and
sending it a USR2 signal with "kill".
- Find a product ID that appeared in the "ldmadmin watch" output and
find the same product in the "pqact" verbose logging output. Note
that if pqact has fallen way behind, the same product ID may not
appear for several hours in the pqact output, but each product
should appear in the exact same order as in the ldmadmin output,
even if you don't do anything with the product in pqact.conf.
- Take the difference between the time stamps for the product in the
pqact log and the "ldmadmin watch" log; that's how far pqact is
running behind. If it's close to the age of the oldest product in
the queue, as shown by the last column in the output of "pqmon",
then you will be getting the "pq_del_oldest conflict:" messages.
- Remember to turn off verbose logging in pqact by sending it another
couple of USR2 signals to cycle into debug logging and then default
silent logging.
Several problems can cause pqact falling behind, including:
- Requesting pqact to write a lot of files. For example, when we
turn on filing all WSI products, each in a separate file using the
"FILE -close" action, eventually pqact falls further and further
behind with all the I/O activity and when it finally is acting on
the oldest product in the queue at the same time that space is
needed for new products, the 'pq_del_oldest: conflict' message
results. Note that when pqact writes a file in a deeply nested
directory structure, it has to open and read each directory in the
path to that file, so there's lots more I/O going on than just
writing a single file. We have seen clusters of as many as 200
products per second, though the average we're handling is only
about 16 products per second.
- Running pqact on a system with slow disks, or writing lots of files
on remotely mounted systems. This appears to have been our
problem, since we only started seeing the pq_del_oldest conflicts
after our RAID disks went out and we temporarily replaced them with
slower disks, then started filing all the WSI products.
- Running pqact on an overloaded or under-configured system. You can
tell if this is happening with the normal commands for monitoring
system load. pqact can invoke lots of decoders in parallel, so a
tool like "top" can let you see what kind of load this is putting
on the system. Some decoders take lots of memory, so its possible
that virtual memory thrashing could result from lots of decoders on
a system without much memory.
- Having too many pqact rules in the pqact.conf file. Each product
ID is sequentially matched against every pattern in the pqact.conf
file before the next product is retrieved from the queue, so having
hundreds or thousands of pattern-action entries could slow down the
processing of each product enough to make pqact fall behind the
data feeds. If this turns out to be a problem, we can provide a
fix by adding some extra action syntax that means "this is the last
thing I want to do with this product so skip all the remaining
patterns-action rules and start over with the next product".
- Having not enough open-file and running-decoder slots for pqact.
To avoid opening and closing lots of files and starting up new
decoders for each product, pqact has a stack of open file
descriptors for the most recently used files and pipes to decoders.
The default number of these is MAXENTRIES, a macro defined in
pqact/filel.c and currently set to 32. You may need to increase
this and recompile, if pqact appends to lots of files and sends
data to lots of different decoders.
- Having a product queue that's too small, so there is insufficient
elasticity for pqact when it has too much to do
I'd be interested if you think any of these apply in your situation.
I don't think there's much that can be done to make pqact more
efficient currently, since it uses fast algorithms for pattern
matching and tries to use file descriptors efficiently to avoid
unnecessary process creation or file opening/closing. But if you
profile pqact and find a bottleneck that's slowing it down, we'd be
interested in hearing about it.
--Russ
_____________________________________________________________________
Russ Rew UCAR Unidata Program
address@hidden http://www.unidata.ucar.edu