[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[LDM #BXE-682607]: What is the point of a large LDM Queue Size?
- Subject: [LDM #BXE-682607]: What is the point of a large LDM Queue Size?
- Date: Fri, 26 Jul 2024 12:05:01 -0500
I am sure you are not the only one that has wrestled with this, so please
forgive my rather verbose reply as I am also going to cut and paste to the
ldm-users list my reply to your question to provide guidance on queue size.
So, please don't take the following as patronizing, I am merely going to cover
all the bases as I would rather get the content out to everyone.
One of the powers of *nix operation systems is the ability to performance tune
the OS and applications to the specific use. This also applies to LDM.
The purpose of the LDM queue is a network/processing "shock absorber" to help
mitigate bottlenecks; the concept of streaming directly from node to node with
no buffering would be great in the perfect world, but we don't live in a
perfect world.
There are three conditions for which a large queue size is important (although
the first two are related - i.e. what is done after data is received):
1) the node in question is feeding downstream nodes via REQUESTS;
2) the node in question is performing a lot of post-reception processing
(pqact) that can challenge the OS kernel scheduler to keep up;
3) the node in question is a "top-tier" node for injecting data onto the local
LDM queue - i.e., pqinsert or SBN (NOAAPort/NWWS) reception.
Taking the first listed instance into account, if your node is a terminal node
- i.e. you are not feeding any downstream nodes - and processing is minimal
(writing to a file system, for instance) - there really is not a need for a
large queue. The calculation for the queue size is mostly to account for a
full scans-worth of storage slots of the largest product you may expect, such
as the netCDF GOES data tiles. A 500M queue would be plenty, and light and
nibble.
Transversely, if your node is a relay *or* "top-tier" node where products are
injected via pqinsert or noaaportIngester (addressing the first and third
listed instances), and you are feeding downstream LDM nodes, the desire maybe
to build in the largest queue possible given the physical memory constraints of
the machine or VM instance. The purpose of such a large queue is to build in
as much resilience to a network outage or bottleneck as possible. As with
every condition I will discuss here, there is always an exception . . . if you
are using your LDM to receive NOAAPort, but that is it . . . you have
noaaportIngester(s) running per PID you desire, and you direct the output from
pqact just to a file system . . . you don't need a lot of queue for that,
either. I would suggest more than 500M, but not much, unless you are on a
666MHz Pentium III ;-)
Finally, the more complicated scenario of the three listed is the second one;
you will need to pull on your big-person system administration pants and do
some math and bench marking. The reason this one is more complicated is that
you have weigh the need of the queue size to be large enough to handle all the
pqact enabled or piped processing, yet small enough to leave resources for any
other LDM-external applications or services running that are fed or piped the
data - such as an RDBMS or graphics creation task.
External to LDM, using top, w, mpstat, and others can help you track the memory
and processing usage of an RDBMS engine, or other post-processing that are not
directly piped to using pqact. The use of incrontab, for instance, watches for
files to close in a directory to do a job, and may not be directly related to
an LDM session.
Internal to LDM, one can encapsulate a PIPE instance in the pqact configuration
file into a script, and fire the actual executable using the "time" built-in to
measure the system and real-time processing statistics of something, and then
average these to get a determination the minimum queue size to have when
compared to the output from pqmon. Your goal is to have a queue large enough
to handle the potential bottleneck of processing, yet leaving resources for
other non-LDM processing.
Finally, the issue of physical memory and LDM. I don't know anyone that
compiles LDM without mmap(), so it is worth mentioning some basics about the
queue and memory usage. In almost every case of modern LDM loads that I know
of, the size of the queue is limited by the physical memory of the hardware/VM
instance because the queue is mapped to memory for speed and rapid access. If
you have set the queue to 32768M in the registry.xml, you need to insure that
you have more than that as physical/VM allocated memory or you will set
yourself up for swapping (if enabled), or faults in running. And this only
applies if the queue is actually configured "on-disk". If you use ramdisk/shm
as your queue partition, then you will need more than twice the queue size as
indicated in the registry.xml. One final thing to consider, the larger the
queue, the more time it takes to create and manage at the system level. Just a
simple function of number of bytes.
The point of this is to illustrate that simply configuring for the maximum
queue size possible is not necessarily a good practice. If you are running
purely a "relay" node - you have a REQUEST to an upstream node, and you are
passing to downstream nodes - then, sure. Not running a graphical console, not
really anything else, I would be comfortable with an instance having a
physical/VM allocated 48G making the queue 40G if the queue is on disk, or 20G
if the queue is in ramdisk/shm (which is done for speed, by the way).
If you are inserting a satellite-sourced stream, such as NOAAPort, I would
suggest the same queue would require a minimum of 64G physical/VM allocated
memory, mostly because such a machine is normally a "relay" - except instead of
an upstream REQUEST, you are performing inserts. The additional physical
memory in this instance is to accommodate the processing of the
noaaportIngester local buffering.
With the instance of doing a lot of tasks while receiving data, pqmon will
guide you on the oldest product on the queue, and you can measure that against
the stats you acquire through system tools such as top, w or mpstat. You want
to avoid "strangling" your machine by not leaving enough resources external to
LDM to do what you want to do.
One last thing I want to point out as I know it is popular with some to put the
queue on ramdisk/shm . . . although fast, it is not resilient. Unlike with
using the fast file system, once you reboot, that queue is gone. If you want
to have some resemblance of a queue, and you do "nice" LDM shutdowns before
rebooting, when the machine comes back up, you will have your original queue
ready to roll. If you do the memory route, then you will need to "mkqueue"
upon every reboot. I am not saying one method is better than the other, I
merely want to point out the caveat to using the faster "disk space in memory"
paradigm.
-
Stonie Cooper, PhD
Software Engineer III
NSF Unidata Program Center
University Corporation for Atmospheric Research
I acknowledge that the land I live and work on is the traditional territory of
The Pawnee, The Omaha, and The Otoe.
Ticket Details
===================
Ticket ID: BXE-682607
Department: Support LDM
Priority: Normal
Status: Open
===================
NOTE: All email exchanges with NSF Unidata User Support are recorded in the
Unidata inquiry tracking system and then made publicly available through the
web. If you do not want to have your interactions made available in this way,
you must let us know in each email you send to us.