[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[LDM #WSJ-190258]: queue size question
- Subject: [LDM #WSJ-190258]: queue size question
- Date: Mon, 25 Mar 2019 11:27:26 -0600
John,
> Some portion of the LDM radar and satellite image creation failed again
> overnight and I'm not sure what the issue is. The "pqmon" shows the max age
> at 11004 so that looks better now.
>
> 20190325T150821.335052Z pqmon[13846] NOTE pqmon.c:358:main() nprods nfree
> nempty nbytes maxprods maxfree minempty maxext age
> 20190325T150821.335079Z pqmon[13846] NOTE pqmon.c:466:main() 1351711 3
> 0 123399368448 1351713 6 0 22486561072 11004
Wow! 123 gigabytes! You took what I said and ran with it! :-)
You should be OK -- although with only 24 GB of memory your system will be
swapping portions of the product-queue in and out continuously. I recommend
monitoring the LDM system via the "ldmadmin addmetrics" and "ldmadmin
plotmetrics" facilities. See the documentation for details.
If you can increase the amount of physical memory to be, say, 120% of the
product-queue, that would make the system more efficient. For your situation,
you would need approximately 44 GB of physical memory in order to save the last
hours worth of data.
> I see a lot of these in the logs, but nothing else that stands out to me.
> 20190324T000001.404224Z pqact[22051] WARN filel.c:3016:reap() Child 10270
> terminated by signal 10
The above means that the child process that was started by a pqact(1) EXEC
entry and whose process ID was 10270 received a SIGUSR1 signal and,
consequently, terminated. This signal is used by the LDM system to cause the
various processes to close and re-open their log files, which is necessary in
order to change to a new log file. Unfortunately, this particular child-process
handled the SIGUSR1 in the default manner: by terminating abnormally.
Can you determine what program corresponded to PID 10270?
I consider this a bug in the LDM system and will work on a fix for the next
release. Thanks for reporting it.
> I do see these as well, but I’m not sure this is tied to the issue:
> * 20190325T022401.340805Z XXX.XXX.XXX.XXX(feed)[8890] NOTE
> error.c:236:err_log() Couldn't flush connection; flushConnection() failure to
> 199.133.78.4: RPC: Unable to receive; errno = Connection reset by peer
> * 20190325T022603.051454Z XXX.XXX.XXX.XXX(feed)[2789] NOTE
> uldb.c:1535:sm_vetUpstreamLdm() Terminated redundant upstream LDM
> (addr=199.133.78.4, pid=21698, vers=6, type=feeder, mode=alternate,
> sub=(20190325012401.287302 TS_ENDT {{EXP, ".*"}}))
> * 20190325T022603.051555Z XXX.XXX.XXX.XXX(feed)[21698] NOTE
> ldmd.c:306:signal_handler() SIGTERM received
> * 20190325T022603.051605Z XXX.XXX.XXX.XXX(feed)[21698] NOTE
> ldmd.c:187:cleanup() Exiting
> * 20190325T022603.052320Z ldmd[22048] NOTE ldmd.c:170:reap() child 21698
> exited with status 7
The above means that a receiving LDM process on host XXX.XXX.XXX.XXX subscribed
to the same feed as a previous receiving LDM process on the same host. The new
sending LDM process, consequently, terminated the sending LDM process that was
started by the previous receiving LDM process because: 1) there's no sense in
duplicating work; and 2) this is a classic denial-of-service vector.
This can be safely ignored unless the two receiving LDM processes are behind a
NAT, in which case they'll have the same IP address. In this case, the registry
parameter "/server/enable-anti-DOS" at the sending site should be "false".
Regards,
Steve Emmerson
Ticket Details
===================
Ticket ID: WSJ-190258
Department: Support LDM
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata
inquiry tracking system and then made publicly available through the web. If
you do not want to have your interactions made available in this way, you must
let us know in each email you send to us.