[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20011204: LDM: pqbinstats & system crash (fwd)

This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.

Subject: 20011204: LDM: pqbinstats & system crash (fwd)
Date: Tue, 4 Dec 2001 10:10:51 -0700 (MST)


===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================

---------- Forwarded message ----------
Date: Mon, 03 Dec 2001 21:52:08 -0700
From: Unidata Support <address@hidden>
To: address@hidden
     address@hidden
Subject: 20011204: LDM: pqbinstats & system crash


------- Forwarded Message

>To: <address@hidden>
>From: Tom McDermott <address@hidden>
>Subject: LDM: pqbinstats & system crash
>Organization: UCAR/Unidata
>Keywords: 200111301311.fAUDBmN10047

Hi,

I don't think this is the sort of problem that lends itself to a solution,
but I thought I would report anyways.  My server is a Sun SparcStation 10
running Solaris 7 and ldm 5.1.4.  At 3:34AM EST today something happened
which seems to have been triggered by the pqbinstats program.  From the
system log:

Nov 30 03:34:43 vortex unix: BAD TRAP: type=2 rp=fc0997c4 addr=0 mmu_fsr=0
rw=0
Nov 30 03:34:44 vortex unix: pqbinstats:
Nov 30 03:34:44 vortex unix: Illegal instruction
Nov 30 03:34:44 vortex unix: pid=20212, pc=0xf00647fc, sp=0xfc099810,
psr=0x4080
10c5, context=144
Nov 30 03:34:44 vortex unix: g1-g7: 78727300, 0, f8ba76d8, 640, fc099b80,
1, f73
3e9a0
Nov 30 03:34:44 vortex unix: Begin traceback... sp = fc099810
Nov 30 03:34:44 vortex unix: Called from f008fda4, fp=fc099878,
args=f8ba76d8 fc
099a38 fc099b80 fc099ee0 fc099b80 0
Nov 30 03:34:44 vortex unix: Called from f0090148, fp=fc0998d8,
args=fc0999c0 fc
099a38 f8ba76d8 0 0 4000000
Nov 30 03:34:44 vortex unix: Called from f0066e94, fp=fc099b80, args=0
efffec70
0 0 0 1f22c
Nov 30 03:34:44 vortex unix: Called from 13444, fp=efffef10, args=f 38520
198 0
3c06afe0 66
Nov 30 03:34:44 vortex unix: End traceback...
Nov 30 03:34:46 vortex unix: panic:
Nov 30 03:34:46 vortex unix: Illegal instruction
Nov 30 03:34:46 vortex unix:
Nov 30 03:34:46 vortex unix: syncing file systems...
Nov 30 03:34:46 vortex unix:  18
Nov 30 03:34:46 vortex unix:  5
Nov 30 03:34:46 vortex unix:  4
Nov 30 03:34:46 vortex last message repeated 19 times
Nov 30 03:34:46 vortex unix:  cannot sync -- giving up

Anne,

Did anyone mention that this might be a disk problem. Hopefully the reboot
caused the file system fsck and fixed the problem. If this would happen
again, I would suggest running fsck to check for bad disk blocks. If the
bad blocks occurred in the LDM distribution, then the LDM should be
reinstalled. Just some thoughts.

Robb...




This by itself wouldn't have been too bad, but as the last message might
lead you to suspect, when the system rebooted, the product queue was
corrupt.  But instead of the ldm system stopping, the rpc.ldmd server and
pqact processes continued to run and more server processes were spawned as
downstream sites kept trying to connect.  This led to a situation where
the rpc.ldmd processes almost completely chewed up the CPU:

last pid:  7035;  load averages: 94.12, 92.54, 87.27              07:06:25
188 processes: 90 sleeping, 92 running, 3 zombie, 3 on cpu
CPU states:  0.0% idle, 95.7% user,  4.3% kernel,  0.0% iowait,  0.0% swap
Memory: 512M real, 338M free, 107M swap in use, 1065M swap free

  PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  550 ldm        1  58    0  301M 2696K run     15:55  1.75% pqsurf
 5735 ldm        1  59    0  293M 2692K run      1:44  1.64% rpc.ldmd
 5436 ldm        1  59    0  293M 2672K run      2:37  1.44% rpc.ldmd
 5076 ldm        1  49    0  293M 2672K run      2:33  1.36% rpc.ldmd
  552 ldm        1  49    0  293M 2328K run     15:42  1.34% rpc.ldmd
  549 ldm        1  49    0  293M 3280K run     16:11  1.31% pqact
 4989 ldm        1  59    0  293M 2672K run      2:55  1.27% rpc.ldmd
 1780 ldm        1  59    0  293M 2684K run      6:56  1.23% rpc.ldmd
 1419 ldm        1  58    0  293M 2672K run      8:50  1.22% rpc.ldmd
 4487 ldm        1  59    0  293M 2672K run      3:46  1.17% rpc.ldmd
 6188 ldm        1  59    0  293M 2684K run      0:55  1.16% rpc.ldmd
 2542 ldm        1  59    0  293M 2680K run      5:50  1.14% rpc.ldmd
 1049 ldm        1  49    0  293M 2692K run     11:06  1.13% rpc.ldmd
 4802 ldm        1  59    0  293M 2680K run      3:24  1.12% rpc.ldmd
 5892 ldm        1  54    0  293M 2672K run      1:09  1.11% rpc.ldmd
 6827 ldm        1  49    0  293M 2676K run      0:07  1.11% rpc.ldmd
 5159 ldm        1  49    0  293M 2672K run      2:22  1.10% rpc.ldmd
 6420 ldm        1  59    0  293M 2680K run      0:38  1.10% rpc.ldmd

But after manually killing the rpc.ldmd processes (ldmadmin stop didn't
work), I remade the queues and all is now well again.

Tom
-----------------------------------------------------------------------------
Tom McDermott                           Email: address@hidden
Systems Administrator                   Phone: (716) 395-5718
Earth Sciences Dept.                    Fax: (716) 395-2416
SUNY College at Brockport



------- End of Forwarded Message

Prev by Date: Re: Ship Obs
Next by Date: Re: Ship Obs
Previous by thread: 20011204: LDM: pqbinstats & system crash
Next by thread: Re: 20011204: LDM Failover Issues
Index(es):
- Date
- Thread