This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Unidata Support wrote: > > ------- Forwarded Message > > >To: <address@hidden> > >From: Tom McDermott <address@hidden> > >Subject: LDM: pqbinstats & system crash > >Organization: UCAR/Unidata > >Keywords: 200111301311.fAUDBmN10047 > > Hi, > > I don't think this is the sort of problem that lends itself to a solution, > but I thought I would report anyways. My server is a Sun SparcStation 10 > running Solaris 7 and ldm 5.1.4. At 3:34AM EST today something happened > which seems to have been triggered by the pqbinstats program. From the > system log: > > Nov 30 03:34:43 vortex unix: BAD TRAP: type=2 rp=fc0997c4 addr=0 mmu_fsr=0 > rw=0 > Nov 30 03:34:44 vortex unix: pqbinstats: > Nov 30 03:34:44 vortex unix: Illegal instruction > Nov 30 03:34:44 vortex unix: pid=20212, pc=0xf00647fc, sp=0xfc099810, > psr=0x4080 > 10c5, context=144 > Nov 30 03:34:44 vortex unix: g1-g7: 78727300, 0, f8ba76d8, 640, fc099b80, > 1, f73 > 3e9a0 > Nov 30 03:34:44 vortex unix: Begin traceback... sp = fc099810 > Nov 30 03:34:44 vortex unix: Called from f008fda4, fp=fc099878, > args=f8ba76d8 fc > 099a38 fc099b80 fc099ee0 fc099b80 0 > Nov 30 03:34:44 vortex unix: Called from f0090148, fp=fc0998d8, > args=fc0999c0 fc > 099a38 f8ba76d8 0 0 4000000 > Nov 30 03:34:44 vortex unix: Called from f0066e94, fp=fc099b80, args=0 > efffec70 > 0 0 0 1f22c > Nov 30 03:34:44 vortex unix: Called from 13444, fp=efffef10, args=f 38520 > 198 0 > 3c06afe0 66 > Nov 30 03:34:44 vortex unix: End traceback... > Nov 30 03:34:46 vortex unix: panic: > Nov 30 03:34:46 vortex unix: Illegal instruction > Nov 30 03:34:46 vortex unix: > Nov 30 03:34:46 vortex unix: syncing file systems... > Nov 30 03:34:46 vortex unix: 18 > Nov 30 03:34:46 vortex unix: 5 > Nov 30 03:34:46 vortex unix: 4 > Nov 30 03:34:46 vortex last message repeated 19 times > Nov 30 03:34:46 vortex unix: cannot sync -- giving up > > This by itself wouldn't have been too bad, but as the last message might > lead you to suspect, when the system rebooted, the product queue was > corrupt. But instead of the ldm system stopping, the rpc.ldmd server and > pqact processes continued to run and more server processes were spawned as > downstream sites kept trying to connect. This led to a situation where > the rpc.ldmd processes almost completely chewed up the CPU: > > last pid: 7035; load averages: 94.12, 92.54, 87.27 07:06:25 > 188 processes: 90 sleeping, 92 running, 3 zombie, 3 on cpu > CPU states: 0.0% idle, 95.7% user, 4.3% kernel, 0.0% iowait, 0.0% swap > Memory: 512M real, 338M free, 107M swap in use, 1065M swap free > > PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND > 550 ldm 1 58 0 301M 2696K run 15:55 1.75% pqsurf > 5735 ldm 1 59 0 293M 2692K run 1:44 1.64% rpc.ldmd > 5436 ldm 1 59 0 293M 2672K run 2:37 1.44% rpc.ldmd > 5076 ldm 1 49 0 293M 2672K run 2:33 1.36% rpc.ldmd > 552 ldm 1 49 0 293M 2328K run 15:42 1.34% rpc.ldmd > 549 ldm 1 49 0 293M 3280K run 16:11 1.31% pqact > 4989 ldm 1 59 0 293M 2672K run 2:55 1.27% rpc.ldmd > 1780 ldm 1 59 0 293M 2684K run 6:56 1.23% rpc.ldmd > 1419 ldm 1 58 0 293M 2672K run 8:50 1.22% rpc.ldmd > 4487 ldm 1 59 0 293M 2672K run 3:46 1.17% rpc.ldmd > 6188 ldm 1 59 0 293M 2684K run 0:55 1.16% rpc.ldmd > 2542 ldm 1 59 0 293M 2680K run 5:50 1.14% rpc.ldmd > 1049 ldm 1 49 0 293M 2692K run 11:06 1.13% rpc.ldmd > 4802 ldm 1 59 0 293M 2680K run 3:24 1.12% rpc.ldmd > 5892 ldm 1 54 0 293M 2672K run 1:09 1.11% rpc.ldmd > 6827 ldm 1 49 0 293M 2676K run 0:07 1.11% rpc.ldmd > 5159 ldm 1 49 0 293M 2672K run 2:22 1.10% rpc.ldmd > 6420 ldm 1 59 0 293M 2680K run 0:38 1.10% rpc.ldmd > > But after manually killing the rpc.ldmd processes (ldmadmin stop didn't > work), I remade the queues and all is now well again. > > Tom > ----------------------------------------------------------------------------- > Tom McDermott Email: address@hidden > Systems Administrator Phone: (716) 395-5718 > Earth Sciences Dept. Fax: (716) 395-2416 > SUNY College at Brockport > > ------- End of Forwarded Message Hi there, Tom, In two years I have not heard of pqbinstats crashing. If you have a core file we can see where it crashed and what it was doing, which may or may not lead us to a conclusion about why it happened. One possibility is a bad disk block. It could be that the reboot detected and repaired that - would that appear in your logs? If this happens again, you could run fsck to scan for bad blocks. If there are bad blocks underneath the ldm installation, a reinstallation would be prudent. I believe pqbinstats reads the queue, so that might explain the queue corruption. It is not uncommon to see the runaway rpc.ldmd processes once the ldm gets in such a confused state. At that point, killing them by hand like you did may be the only option. Anne -- *************************************************** Anne Wilson UCAR Unidata Program address@hidden P.O. Box 3000 Boulder, CO 80307 ---------------------------------------------------- Unidata WWW server http://www.unidata.ucar.edu/ ****************************************************