[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
19991123: ldm problems after reboot at St. Cloud State
- Subject: 19991123: ldm problems after reboot at St. Cloud State
- Date: Tue, 23 Nov 1999 16:28:24 -0700
>From: alan anderson <address@hidden>
>Organization: St. Cloud State
>Keywords: 199911232244.PAA17363 LDM
Alan,
>We are having a problem with our ldm machine. I checked it today
>and found it in a state that looked to me like it had been shutdown
>and rebooted; a ps indicated no ldm processes running.
>The ldm does not start automatically upon a reboot.
Does not start, or did not start?
>I did an ldmadmin stop just to be sure, and then rebooted. System
>came back up with no problems or messages.
>
>Tried to start the ldm, but the start did not confirm; I recalled that
>queue often is corrupted, so deleted the queue, then a mkqueue, then
>restarted ldm, which was confirmed.
Did you make sure to become the user 'ldm' before trying to do restart
the LDM? This is important: the LDM should never be run as 'root'.
>Log files (excerpt below) show that something is still wrong. My shallow
>memory about what this could be leaves me blank, so I am writing to you.
No problem.
>waldo is the place where ldm lives; I think you already know the rest
>if you want to look around. Otherwise, could I have some instructions?
I decided to login and do some snooping. More below.
>Nov 23 22:27:58 waldo pqact[1514]: pbuf_flush (4) write: Broken pipe
>Nov 23 22:27:58 waldo pqact[1514]: pipe_dbufput: xcd_runDDS write error
>Nov 23 22:27:58 waldo pqact[1514]: pipe_prodput: trying again
>Nov 23 22:27:58 waldo pqact[1514]: pbuf_flush (4) write: Broken pipe
>Nov 23 22:27:58 waldo pqact[1514]: pipe_dbufput: xcd_runDDS write error
>Nov 23 22:27:58 waldo pqact[1514]: child 5357 exited with status 127
>Nov 23 22:27:58 waldo pqact[1514]: child 5355 exited with status 127
>Nov 23 22:27:58 waldo pqact[1514]: pbuf_flush (4) write: Broken pipe
>Nov 23 22:27:58 waldo pqact[1514]: pipe_dbufput: xcd_runDDS write error
>Nov 23 22:27:58 waldo pqact[1514]: pipe_prodput: trying again
>Nov 23 22:27:58 waldo pqact[1514]: pbuf_flush (4) write: Broken pipe
>Nov 23 22:27:58 waldo pqact[1514]: pipe_dbufput: xcd_runDDS write error
>Nov 23 22:27:58 waldo pqact[1514]: child 5361 exited with status 127
>Nov 23 22:27:58 waldo pqact[1514]: child 5359 exited with status 127
The repeated start and failure of 'xcd_run DDS' is telling us that the
process that xcd_run is running (ingetext.k in this case) is exiting
without reading from from the LDM. This is a big hint that the LDM was
most likely not started by the user 'ldm' since things were working
correctly before.
I did a quick look around and found that you must have started the LDM
as 'root' as a number of files were owned by root:
/usr/local/ldm% ls -al
total 10980
drwxr-xr-x 16 ldm data 1024 Nov 23 22:07 ./
-rw-rw-r-- 1 root other 5 Nov 23 22:07 ldmd.pid
waldo# ls -al
total 3854
drwxr-xr-x 2 ldm other 1024 Nov 23 22:48 .
drwxr-xr-x 7 ldm data 512 Nov 23 22:06 ..
-rw-rw-r-- 1 ldm data 446 Nov 21 19:03 1999112118.stats
-rw-rw-r-- 1 ldm data 446 Nov 21 20:03 1999112119.stats
-rw-rw-r-- 1 ldm data 446 Nov 21 21:04 1999112120.stats
-rw-rw-r-- 1 ldm data 446 Nov 21 22:09 1999112121.stats
-rw-rw-r-- 1 ldm data 446 Nov 21 23:03 1999112122.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 00:03 1999112123.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 01:03 1999112200.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 02:04 1999112201.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 03:05 1999112202.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 04:08 1999112203.stats
-rw-rw-r-- 1 ldm data 110 Nov 22 04:45 1999112204.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 06:27 1999112205.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 07:03 1999112206.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 08:22 1999112207.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 09:03 1999112208.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 10:03 1999112209.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 11:03 1999112210.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 12:03 1999112211.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 13:03 1999112212.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 14:03 1999112213.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 15:59 1999112214.stats
-rw-rw-r-- 1 ldm data 446 Nov 22 16:28 1999112215.stats
-rw-rw-r-- 1 root other 449 Nov 23 21:59 1999112320.stats
-rw-rw-r-- 1 root other 559 Nov 23 22:49 1999112321.stats
-rw-rw-r-- 1 root other 559 Nov 23 23:07 1999112322.stats
-rw-rw-r-- 1 ldm data 301 Apr 23 1999 f.log
-rw-r--r-- 1 ldm data 1146 Nov 23 22:35 ldmbinstats.upc
-rw-rw-r-- 1 root other 1587076 Nov 23 23:07 ldmd.log
-rw-rw-r-- 1 root other 92333 Nov 23 21:59 ldmd.log.1
-rw-r--r-- 1 ldm data 3938 Nov 23 21:49 ldmd.log.2
-rw-r--r-- 1 ldm data 87387 Nov 22 16:27 ldmd.log.3
-rw-r--r-- 1 ldm data 142790 Nov 21 23:58 ldmd.log.4
-rw-r--r-- 1 ldm data 0 Mar 29 1999 ldmfail
-rw-rw-r-- 1 ldm data 3591 Apr 23 1999 netcheck.log
I corrected this by becoming 'root' and changing the ownership of all
files owned by 'root' in the ~ldm directory tree. This included
~ldm/data/ldmd.pq, the LDM product queue:
waldo# chown ldm *
waldo# chgrp data *
waldo# cd ~ldm/logs
waldo# chown ldm *
waldo# chgrp data *
waldo# cd ~ldm/data
waldo# chown ldm *
waldo# chgrp data *
Next, I tried starting the LDM as 'ldm', but I couldn't since the
hidden LDM lock file in /tmp was still owned by 'root'. So, I became
root again and stopped the LDM:
su -
<password>
exec csh
setenv PATH ~ldm/bin:$PATH
ldmadmin stop
exit
After this I was back to being the user 'ldm'. For good measure, I did
an 'ldmadmin stop' and then started the LDM:
ldmadmin stop
ldmadmin start
ldmadmin tail
/usr/local/ldm/logs% ldmadmin tail
Nov 23 23:17:04 waldo chinook[17562]: run_requester: 19991123222238.441 TS_ENDT
{{FSL2|MCIDAS|IDS|DDPLUS, ".*"}}
Nov 23 23:17:04 waldo chinook[17562]: FEEDME(chinook.unl.edu): OK
Nov 23 23:17:05 waldo udp.ldmd[17566]: Starting Up
Nov 23 23:17:06 waldo localhost[17590]: Connection from localhost
Nov 23 23:17:06 waldo localhost[17590]: Connection reset by peer
Nov 23 23:17:06 waldo localhost[17590]: Exiting
Nov 23 23:17:45 waldo proftomd[17596]: Starting up
Nov 23 23:17:46 waldo proftomd[17596]: Making /var/data/mcidas/MDXX0097; may
take some time...
Nov 23 23:17:49 waldo proftomd[17596]: Decoding 1999327.2212 data into
/var/data/mcidas/MDXX0097
Nov 23 23:17:49 waldo proftomd[17596]: Exiting
Nov 23 23:21:00 waldo lwtoa3[17606]: PRODUCT CODE=UX 99327 223019
Nov 23 23:21:00 waldo lwtoa3[17606]: Done -- AREA= 109
Nov 23 23:21:06 waldo pqact[17558]: pbuf_flush (6) write: Broken pipe
Nov 23 23:21:06 waldo pqact[17558]: pbuf_flush 6: time elapsed 5.351715
Nov 23 23:21:06 waldo pqact[17558]: pipe_dbufput:
-closelwtoa3-d/var/data/mcidas write error
Nov 23 23:21:06 waldo pqact[17558]: pipe_prodput: trying again
Nov 23 23:21:06 waldo lwtoa3[17622]: PRODUCT CODE=UX 99327 223019
Nov 23 23:21:06 waldo lwtoa3[17622]: Done -- AREA= 100
Nov 23 23:21:10 waldo pqact[17558]: pbuf_flush (6) write: Broken pipe
Nov 23 23:21:10 waldo pqact[17558]: pbuf_flush 6: time elapsed 4.002119
Nov 23 23:21:10 waldo pqact[17558]: pipe_dbufput:
-closelwtoa3-d/var/data/mcidas write error
Nov 23 23:22:04 waldo pqexpire[17555]: > Recycled 27588.838 kb/hr ( 3567.163
prods per hour)
Nov 23 23:22:39 waldo lwtoa3[17642]: PRODUCT CODE=UA 99327 223134
Nov 23 23:22:41 waldo lwtoa3[17642]: Done -- AREA= 167
Nov 23 23:27:04 waldo pqexpire[17555]: > Recycled 19132.159 kb/hr ( 2987.324
prods per hour)
The pbuf_flush (6) write: Broken pipe error seemed to be telling me
that an lwtoa3 process was failing to write to an AREA file in
/var/data/mcidas, but I looked there and all files are owned by 'ldm'.
The success of AREA0167 further told me that things seemed to be
working correctly, so I decided to let things run and see what
happens.
Please let me know if you see problems with ldm-mcidas or XCD data
decoding.
Tom
>From address@hidden Wed Nov 24 13:05:53 1999
Hi Tom
Just a short note to acknowledge your fix on waldo. I was
not aware of the problems created by having root perform any
system maintenance on the ldm.
Our system seems to be working fine again.
Thanks
alan