[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20011204: LDM: pqbinstats & system crash
- Subject: Re: 20011204: LDM: pqbinstats & system crash
- Date: Tue, 04 Dec 2001 13:19:35 -0700
Unidata Support wrote:
>
> ------- Forwarded Message
>
> >To: <address@hidden>
> >From: Tom McDermott <address@hidden>
> >Subject: LDM: pqbinstats & system crash
> >Organization: UCAR/Unidata
> >Keywords: 200111301311.fAUDBmN10047
>
> Hi,
>
> I don't think this is the sort of problem that lends itself to a solution,
> but I thought I would report anyways. My server is a Sun SparcStation 10
> running Solaris 7 and ldm 5.1.4. At 3:34AM EST today something happened
> which seems to have been triggered by the pqbinstats program. From the
> system log:
>
> Nov 30 03:34:43 vortex unix: BAD TRAP: type=2 rp=fc0997c4 addr=0 mmu_fsr=0
> rw=0
> Nov 30 03:34:44 vortex unix: pqbinstats:
> Nov 30 03:34:44 vortex unix: Illegal instruction
> Nov 30 03:34:44 vortex unix: pid=20212, pc=0xf00647fc, sp=0xfc099810,
> psr=0x4080
> 10c5, context=144
> Nov 30 03:34:44 vortex unix: g1-g7: 78727300, 0, f8ba76d8, 640, fc099b80,
> 1, f73
> 3e9a0
> Nov 30 03:34:44 vortex unix: Begin traceback... sp = fc099810
> Nov 30 03:34:44 vortex unix: Called from f008fda4, fp=fc099878,
> args=f8ba76d8 fc
> 099a38 fc099b80 fc099ee0 fc099b80 0
> Nov 30 03:34:44 vortex unix: Called from f0090148, fp=fc0998d8,
> args=fc0999c0 fc
> 099a38 f8ba76d8 0 0 4000000
> Nov 30 03:34:44 vortex unix: Called from f0066e94, fp=fc099b80, args=0
> efffec70
> 0 0 0 1f22c
> Nov 30 03:34:44 vortex unix: Called from 13444, fp=efffef10, args=f 38520
> 198 0
> 3c06afe0 66
> Nov 30 03:34:44 vortex unix: End traceback...
> Nov 30 03:34:46 vortex unix: panic:
> Nov 30 03:34:46 vortex unix: Illegal instruction
> Nov 30 03:34:46 vortex unix:
> Nov 30 03:34:46 vortex unix: syncing file systems...
> Nov 30 03:34:46 vortex unix: 18
> Nov 30 03:34:46 vortex unix: 5
> Nov 30 03:34:46 vortex unix: 4
> Nov 30 03:34:46 vortex last message repeated 19 times
> Nov 30 03:34:46 vortex unix: cannot sync -- giving up
>
> This by itself wouldn't have been too bad, but as the last message might
> lead you to suspect, when the system rebooted, the product queue was
> corrupt. But instead of the ldm system stopping, the rpc.ldmd server and
> pqact processes continued to run and more server processes were spawned as
> downstream sites kept trying to connect. This led to a situation where
> the rpc.ldmd processes almost completely chewed up the CPU:
>
> last pid: 7035; load averages: 94.12, 92.54, 87.27 07:06:25
> 188 processes: 90 sleeping, 92 running, 3 zombie, 3 on cpu
> CPU states: 0.0% idle, 95.7% user, 4.3% kernel, 0.0% iowait, 0.0% swap
> Memory: 512M real, 338M free, 107M swap in use, 1065M swap free
>
> PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
> 550 ldm 1 58 0 301M 2696K run 15:55 1.75% pqsurf
> 5735 ldm 1 59 0 293M 2692K run 1:44 1.64% rpc.ldmd
> 5436 ldm 1 59 0 293M 2672K run 2:37 1.44% rpc.ldmd
> 5076 ldm 1 49 0 293M 2672K run 2:33 1.36% rpc.ldmd
> 552 ldm 1 49 0 293M 2328K run 15:42 1.34% rpc.ldmd
> 549 ldm 1 49 0 293M 3280K run 16:11 1.31% pqact
> 4989 ldm 1 59 0 293M 2672K run 2:55 1.27% rpc.ldmd
> 1780 ldm 1 59 0 293M 2684K run 6:56 1.23% rpc.ldmd
> 1419 ldm 1 58 0 293M 2672K run 8:50 1.22% rpc.ldmd
> 4487 ldm 1 59 0 293M 2672K run 3:46 1.17% rpc.ldmd
> 6188 ldm 1 59 0 293M 2684K run 0:55 1.16% rpc.ldmd
> 2542 ldm 1 59 0 293M 2680K run 5:50 1.14% rpc.ldmd
> 1049 ldm 1 49 0 293M 2692K run 11:06 1.13% rpc.ldmd
> 4802 ldm 1 59 0 293M 2680K run 3:24 1.12% rpc.ldmd
> 5892 ldm 1 54 0 293M 2672K run 1:09 1.11% rpc.ldmd
> 6827 ldm 1 49 0 293M 2676K run 0:07 1.11% rpc.ldmd
> 5159 ldm 1 49 0 293M 2672K run 2:22 1.10% rpc.ldmd
> 6420 ldm 1 59 0 293M 2680K run 0:38 1.10% rpc.ldmd
>
> But after manually killing the rpc.ldmd processes (ldmadmin stop didn't
> work), I remade the queues and all is now well again.
>
> Tom
> -----------------------------------------------------------------------------
> Tom McDermott Email: address@hidden
> Systems Administrator Phone: (716) 395-5718
> Earth Sciences Dept. Fax: (716) 395-2416
> SUNY College at Brockport
>
> ------- End of Forwarded Message
Hi there, Tom,
In two years I have not heard of pqbinstats crashing. If you have a
core file we can see where it crashed and what it was doing, which may
or may not lead us to a conclusion about why it happened.
One possibility is a bad disk block. It could be that the reboot
detected and repaired that - would that appear in your logs? If this
happens again, you could run fsck to scan for bad blocks. If there are
bad blocks underneath the ldm installation, a reinstallation would be
prudent.
I believe pqbinstats reads the queue, so that might explain the queue
corruption. It is not uncommon to see the runaway rpc.ldmd processes
once the ldm gets in such a confused state. At that point, killing them
by hand like you did may be the only option.
Anne
--
***************************************************
Anne Wilson UCAR Unidata Program
address@hidden P.O. Box 3000
Boulder, CO 80307
----------------------------------------------------
Unidata WWW server http://www.unidata.ucar.edu/
****************************************************