[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20011204: LDM: pqbinstats & system crash (fwd)
- Subject: 20011204: LDM: pqbinstats & system crash (fwd)
- Date: Tue, 4 Dec 2001 10:10:51 -0700 (MST)
===============================================================================
Robb Kambic Unidata Program Center
Software Engineer III Univ. Corp for Atmospheric Research
address@hidden WWW: http://www.unidata.ucar.edu/
===============================================================================
---------- Forwarded message ----------
Date: Mon, 03 Dec 2001 21:52:08 -0700
From: Unidata Support <address@hidden>
To: address@hidden
address@hidden
Subject: 20011204: LDM: pqbinstats & system crash
------- Forwarded Message
>To: <address@hidden>
>From: Tom McDermott <address@hidden>
>Subject: LDM: pqbinstats & system crash
>Organization: UCAR/Unidata
>Keywords: 200111301311.fAUDBmN10047
Hi,
I don't think this is the sort of problem that lends itself to a solution,
but I thought I would report anyways. My server is a Sun SparcStation 10
running Solaris 7 and ldm 5.1.4. At 3:34AM EST today something happened
which seems to have been triggered by the pqbinstats program. From the
system log:
Nov 30 03:34:43 vortex unix: BAD TRAP: type=2 rp=fc0997c4 addr=0 mmu_fsr=0
rw=0
Nov 30 03:34:44 vortex unix: pqbinstats:
Nov 30 03:34:44 vortex unix: Illegal instruction
Nov 30 03:34:44 vortex unix: pid=20212, pc=0xf00647fc, sp=0xfc099810,
psr=0x4080
10c5, context=144
Nov 30 03:34:44 vortex unix: g1-g7: 78727300, 0, f8ba76d8, 640, fc099b80,
1, f73
3e9a0
Nov 30 03:34:44 vortex unix: Begin traceback... sp = fc099810
Nov 30 03:34:44 vortex unix: Called from f008fda4, fp=fc099878,
args=f8ba76d8 fc
099a38 fc099b80 fc099ee0 fc099b80 0
Nov 30 03:34:44 vortex unix: Called from f0090148, fp=fc0998d8,
args=fc0999c0 fc
099a38 f8ba76d8 0 0 4000000
Nov 30 03:34:44 vortex unix: Called from f0066e94, fp=fc099b80, args=0
efffec70
0 0 0 1f22c
Nov 30 03:34:44 vortex unix: Called from 13444, fp=efffef10, args=f 38520
198 0
3c06afe0 66
Nov 30 03:34:44 vortex unix: End traceback...
Nov 30 03:34:46 vortex unix: panic:
Nov 30 03:34:46 vortex unix: Illegal instruction
Nov 30 03:34:46 vortex unix:
Nov 30 03:34:46 vortex unix: syncing file systems...
Nov 30 03:34:46 vortex unix: 18
Nov 30 03:34:46 vortex unix: 5
Nov 30 03:34:46 vortex unix: 4
Nov 30 03:34:46 vortex last message repeated 19 times
Nov 30 03:34:46 vortex unix: cannot sync -- giving up
Anne,
Did anyone mention that this might be a disk problem. Hopefully the reboot
caused the file system fsck and fixed the problem. If this would happen
again, I would suggest running fsck to check for bad disk blocks. If the
bad blocks occurred in the LDM distribution, then the LDM should be
reinstalled. Just some thoughts.
Robb...
This by itself wouldn't have been too bad, but as the last message might
lead you to suspect, when the system rebooted, the product queue was
corrupt. But instead of the ldm system stopping, the rpc.ldmd server and
pqact processes continued to run and more server processes were spawned as
downstream sites kept trying to connect. This led to a situation where
the rpc.ldmd processes almost completely chewed up the CPU:
last pid: 7035; load averages: 94.12, 92.54, 87.27 07:06:25
188 processes: 90 sleeping, 92 running, 3 zombie, 3 on cpu
CPU states: 0.0% idle, 95.7% user, 4.3% kernel, 0.0% iowait, 0.0% swap
Memory: 512M real, 338M free, 107M swap in use, 1065M swap free
PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
550 ldm 1 58 0 301M 2696K run 15:55 1.75% pqsurf
5735 ldm 1 59 0 293M 2692K run 1:44 1.64% rpc.ldmd
5436 ldm 1 59 0 293M 2672K run 2:37 1.44% rpc.ldmd
5076 ldm 1 49 0 293M 2672K run 2:33 1.36% rpc.ldmd
552 ldm 1 49 0 293M 2328K run 15:42 1.34% rpc.ldmd
549 ldm 1 49 0 293M 3280K run 16:11 1.31% pqact
4989 ldm 1 59 0 293M 2672K run 2:55 1.27% rpc.ldmd
1780 ldm 1 59 0 293M 2684K run 6:56 1.23% rpc.ldmd
1419 ldm 1 58 0 293M 2672K run 8:50 1.22% rpc.ldmd
4487 ldm 1 59 0 293M 2672K run 3:46 1.17% rpc.ldmd
6188 ldm 1 59 0 293M 2684K run 0:55 1.16% rpc.ldmd
2542 ldm 1 59 0 293M 2680K run 5:50 1.14% rpc.ldmd
1049 ldm 1 49 0 293M 2692K run 11:06 1.13% rpc.ldmd
4802 ldm 1 59 0 293M 2680K run 3:24 1.12% rpc.ldmd
5892 ldm 1 54 0 293M 2672K run 1:09 1.11% rpc.ldmd
6827 ldm 1 49 0 293M 2676K run 0:07 1.11% rpc.ldmd
5159 ldm 1 49 0 293M 2672K run 2:22 1.10% rpc.ldmd
6420 ldm 1 59 0 293M 2680K run 0:38 1.10% rpc.ldmd
But after manually killing the rpc.ldmd processes (ldmadmin stop didn't
work), I remade the queues and all is now well again.
Tom
-----------------------------------------------------------------------------
Tom McDermott Email: address@hidden
Systems Administrator Phone: (716) 395-5718
Earth Sciences Dept. Fax: (716) 395-2416
SUNY College at Brockport
------- End of Forwarded Message