[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20050214: LDM product queue corruption
- Subject: 20050214: LDM product queue corruption
- Date: Mon, 14 Feb 2005 14:08:23 -0700
Gabe,
>Date: Mon, 14 Feb 2005 15:21:52 -0500 (EST)
>From: Gabe Langbauer <address@hidden>
>Organization: Ohio State University
>To: Steve Emmerson <address@hidden>
>Subject: Re: 20050214: LDM product queue corruption
The above message contained the following:
> The original log is attached, note there is no ldmping issue on this log,
> it seems to die with a rpc.ldmd error...and there is a mention of rtstats.
> I don't know if those are the stats from "do stats" Everytime subsequent
> time I issued the start command I got this log (although times were
> different):
>
> Feb 12 23:24:21 twister ldmping[10477]: SVC_UNAVAIL 0.000601 0
> localhost RPC: Program not registered
> Feb 12 23:24:21 twister pqcheck[10481]: Starting Up (10472)
> Feb 12 23:24:21 twister pqcheck[10481]: The writer-counter of the
> product-queue is 0
> Feb 12 23:24:21 twister pqcheck[10481]: Exiting
The above are OK. The "ldmping" entry is from the ldmadmin(1) script
testing to see if an LDM is already running. The pqcheck(1) entries are
from the same script checking to see that the product-queue is OK.
> I agree, mighty suspicious indeed. Logs above
The end of the logfile contained this
Feb 12 22:58:54 twister rpc.ldmd[791]: child 793 terminated by signal 25
Process 793 was a pqact(1) process:
$ fgrep '[793]' ldmd.log.4
Feb 12 07:02:16 twister pqact[793]: child 569 exited with status 1
Feb 12 07:58:21 twister pqact[793]: child 16497 exited with status 1
Feb 12 21:12:23 twister pqact[793]: child 11341 exited with status 1
Feb 12 22:30:00 twister pqact[793]: pbuf_flush (3) write: Broken pipe
and was, undoubtably, started via an EXEC entry in the LDM
configuration-file, etc/ldmd.conf.
The LDM server exits when an EXEC-ed child process terminates abnormally
due to a seriously bad signal (e.g., SIGSEGV).
Oddly, on my system, signal 25 is SIGCONT and should not cause the
pqact(1) process to terminate. What is it on your system?
One can work-around this behavior by wrapping EXEC-ed programs in a
shell-script that ensures that their abnormal termination is never seen
by the LDM, e.g.,
$ cat util/execWrapper
while true
do
"$@"
logger -p local0.notice "Restarting: $@"
done
(The above is off-the-top-of-my-head and might need modification.)
The relevant EXEC entry is then replaced with
EXEC "execWrapper prog a1 a2"
(assuming the script is in the "util/" subdirectory and is executable).
Regards,
Steve Emmerson