[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20020516: help with ldm (complete message)
- Subject: 20020516: help with ldm (complete message)
- Date: Thu, 16 May 2002 13:08:54 -0600
>From: "Jennie L. Moody" <address@hidden>
>Organization: UVa
>Keywords: 200205161823.g4GINca18632 LDM
Hi Jennie,
>Well, it was inevitable that eventually I would have to start
>paying attention to things and trying to fix problems.
Yup. Things were working pretty well awhile ago. I worked with
Tony to cut down on the amount of grib data that gets decoded
with McIDAS-XCD. This was crutial at the time since windfall
kept running out of disk space (in /p4 ?).
>Our webpage stopped updating yesterday, and it looks like we lost
>our connection to our upstream site. Today I got on to see
>if I could just restart the ldm. After realizing that I
>had to make a new password just to get in (I thought
>I new the old one?), I found that there is plenty
>I have forgotten. I was thinking I could just
>stop the ldm, ldmadmin stop
>and then restart it
>ldmadmin start.
That is the correct sequence with the exception that you have
to wait until all LDM processes exit before restarting.
>But I get the message that there is still a server running:
>
>windfall: /usr/local/ldm/etc $ ps -lf -u ldm
> F S UID PID PPID C PRI NI ADDR SZ WCHAN
>
>STIME TTY TIME CMD
> 8 R ldm 469 467 34 85 22 60d1c1f0 37463
>
>14:10:02 ? 942:46 pqact -d /usr/local/ldm -q /usr/loc
> 8 S ldm 467 1 0 47 22 608c4010 275 608c4080
>
>14:10:01 ? 0:00 rpc.ldmd -q /usr/local/ldm/data/ldm
> 8 O ldm 470 467 33 75 22 60d1c8b0 37452
>
>14:10:02 ? 928:12 pqbinstats -d /p4/logs -q /usr/loca
> 8 R ldm 471 467 34 85 22 60d1b470 37472
>
>14:10:02 ? 962:03 rpc.ldmd -q /usr/local/ldm/data/ldm
> 8 S ldm 20640 14588 0 51 20 60d1adb0 204 60d1ae20
It is strange to see pqact in the list of still active processes.
Did you check to see if there was still available disk space? More
down below.
>13:49:00 pts/1 0:00 -ksh
>windfall: /usr/local/ldm/etc $ whoami
>ldm
>windfall: /usr/local/ldm/etc $ ldmadmin stop
>stopping the LDM server...
>LDM server stopped
>windfall: /usr/local/ldm/etc $ ps -lf -u ldm
> F S UID PID PPID C PRI NI ADDR SZ WCHAN
>
>STIME TTY TIME CMD
> 8 R ldm 469 467 34 85 22 60d1c1f0 37463
>
>14:10:02 ? 942:59 pqact -d /usr/local/ldm -q /usr/loc
> 8 S ldm 467 1 0 47 22 608c4010 275 608c4080
>
>14:10:01 ? 0:00 rpc.ldmd -q /usr/local/ldm/data/ldm
> 8 R ldm 470 467 32 95 22 60d1c8b0 37452
>
>14:10:02 ? 928:24 pqbinstats -d /p4/logs -q /usr/loca
> 8 O ldm 471 467 34 75 22 60d1b470 37472
>
>14:10:02 ? 962:16 rpc.ldmd -q /usr/local/ldm/data/ldm
> 8 S ldm 20640 14588 0 51 20 60d1adb0 204 60d1ae20
>
>13:49:00 pts/1 0:00 -ksh
>
>So this didn't seem to do anything, using the dumb approach of
>thinking that some of these processes wouldn't stop if the
>delqueue wasn't run, I tried that (don't ask me why I thought
>this would work...the mental equivalent of pushing buttons)
At this point, I would forcably kill all processes that refuse
to die; verify that they are no longer running; delete and remake
the queue; and then restart.
>windfall: /usr/local/ldm $ ldmadmin stop
>stopping the LDM server...
>LDM server stopped
>windfall: /usr/local/ldm $ ldmadmin delqueue
>May 16 18:13:20 UTC windfall.evsc.Virginia.EDU : delete_pq: A
>
>server is running, cannot delete the queue
Right. The processes that access the queue will have a lock on it,
so you shouldn't be able to delete it.
>So, I don't know whats up.....sadly, I need a refresher course,
>but in the meantime, maybe someone out there can tell me what to
>do, or jump in here....I will happily share the new access info
>
>for user ldm...
OK, I just logged on. What I did was:
windfall: /usr/local/ldm $ ps -u ldm
PID TTY TIME CMD
469 ? 966:43 pqact
467 ? 0:00 rpc.ldmd
470 ? 951:45 pqbinsta
471 ? 986:39 rpc.ldmd
25380 pts/8 0:00 ksh
windfall: /usr/local/ldm $ kill -9 469 467 470 471
windfall: /usr/local/ldm $ ldmadmin delqueue
windfall: /usr/local/ldm $ ldmadmin mkqueue
windfall: /usr/local/ldm $ ldmadmin start
windfall: /usr/local/ldm $ ps -u ldm
PID TTY TIME CMD
25487 ? 0:00 ingetext
25452 ? 0:00 ingetext
25467 ? 0:00 startxcd
25485 ? 0:00 ingebin.
25468 ? 0:00 dmsfc.k
25448 ? 0:00 pqbinsta
25446 ? 0:00 startxcd
25470 ? 0:00 dmgrid.k
25449 ? 0:00 rpc.ldmd
25453 ? 0:00 ingebin.
25447 ? 0:00 pqact
25469 ? 0:00 dmraob.k
25445 ? 0:00 rpc.ldmd
25380 pts/8 0:00 ksh
windfall: /usr/local/ldm $ ldmadmin watch
(Type ^D or ^C when finished)
May 16 18:47:26 pqutil: 25724 20020516174856.095 HDS 145 YHWB90 KWBG
161700 /mRUC2
May 16 18:47:26 pqutil: 19326 20020516174856.200 HDS 150 YTWB90 KWBG
161700 /mRUC2
May 16 18:47:26 pqutil: 19326 20020516174856.298 HDS 155 YVWB85 KWBG
161700 /mRUC2
May 16 18:47:26 pqutil: 19326 20020516174856.415 HDS 160 YUWB90 KWBG
161700 /mRUC2
May 16 18:47:26 pqutil: 19326 20020516174856.558 HDS 165 YVWB90 KWBG
161700 /mRUC2
May 16 18:47:27 pqutil: 4821 20020516174856.583 HDS 167 SDUS84 KLZK
161744 /pDPALZK
...
So, windfall is again feeding from ldm.meteo.psu.edu. The 'ldmadmin watch'
shows that products are being received as expected, so LDM-related
things (including McIDAS-XCD decoders) are running. How this relates
to your web page generation of products we can't say, but presumably
they will come back as data gets decoded and cron-initiated scripts run.
Since things were kinda messed up, I took the opportunity to do some further
cleaning up:
ldmadmin stop
<verify that all LDM processes exit>
cd ~ldm/.mctmp
/bin/rm -rf *
This cleans up subdirectories created by LDM-initiated McIDAS processes.
there were a few left in .mctmp that were fairly old (listing done before
the 'ldmadmin stop' above):
windfall: /usr/local/ldm/.mctmp $ ls -alt
total 46
drwx------ 23 ldm mcidas 512 May 16 14:54 ./
drwx------ 2 ldm mcidas 512 May 16 14:46 2902/
drwx------ 2 ldm mcidas 512 May 16 14:46 3300/
drwx------ 2 ldm mcidas 512 May 16 14:46 3601/
drwxr-xr-x 11 ldm mcidas 1024 May 16 14:46 ../
drwx------ 2 ldm mcidas 512 May 15 13:40 66406/
drwx------ 2 ldm mcidas 512 May 14 07:20 702/
drwx------ 2 ldm mcidas 512 May 14 07:20 801/
drwx------ 2 ldm mcidas 512 May 12 22:10 401/
drwx------ 2 ldm mcidas 512 May 12 22:10 302/
drwx------ 2 ldm mcidas 512 May 12 21:50 202/
drwx------ 2 ldm mcidas 512 Apr 25 06:50 301/
drwx------ 2 ldm mcidas 512 Apr 23 14:50 571600/
drwx------ 2 ldm mcidas 512 Apr 23 14:50 5801/
drwx------ 2 ldm mcidas 512 Apr 23 14:50 778602/
drwx------ 2 ldm mcidas 512 Apr 4 14:26 570300/
drwx------ 2 ldm mcidas 512 Apr 4 14:26 777502/
drwx------ 2 ldm mcidas 512 Mar 19 13:20 4501/
drwx------ 2 ldm mcidas 512 Feb 18 12:37 1090511/
drwx------ 2 ldm mcidas 512 Feb 16 07:37 100716/
drwx------ 2 ldm mcidas 512 Apr 19 2001 683109/
drwx------ 2 ldm mcidas 512 Feb 19 2001 43502/
drwx------ 2 ldm mcidas 512 Feb 19 2001 84901/
The old ones in the list (ones previous to May 16) are the result of
aborted processes. Cleaning them up is a _good thing_ :-)
After making sure that there were no shared memory segments still
allocated to 'ldm' (again, McIDAS use), I restarted the LDM:
ldmadmin start
Things appear to be running smoothly, and the load on windfall is low:
last pid: 27157; load averages: 0.26, 0.35, 0.91 15:04:54
66 processes: 65 sleeping, 1 on cpu
CPU states: 76.2% idle, 18.0% user, 3.8% kernel, 2.0% iowait, 0.0% swap
Memory: 384M real, 5920K free, 58M swap in use, 1266M swap free
>thanks in advance, Tom or Anne or whomever....
Tom here...
>(by the way, this isn't any really time-sensitive
>issue, no operational or quasi-operational work
>going on here)
No problem. This was a quick fix.
Talk to you later...
Tom
>From address@hidden Thu May 16 18:51:27 2002
Thanks so much Tom!
My instict was to just kill the processes, so I don't
know why I didn't, just confusion I guess.
Jennie