[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20020520: motherlode:/home disk sutuation
- Subject: 20020520: motherlode:/home disk sutuation
- Date: Tue, 21 May 2002 01:32:49 -0600
>From: "Mike Schmidt" <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200205202252.g4KMqUi1660870 McIDAS-XCD XCD_START.LOG
Mike,
>The /home partition on motherlode is still full and a little
>investigation reveals the following;
>
> - total disk usage of /home according to "du -sk" is about 4GB
> - actual disk usage from "df -kl" shows 15GB used
> - "lsof | grep home" show that several mcidas processes have what
> appears to be an 11GB file open with inode 800062 that must have
> had its directory entry removed, yet the file is still open
> - inode 800062 is likely to have resided in /home/mcidas/upcworkdata
> given that inodes 800061 and 800063 reside there;
>
>(motherlode) 193 # ls -laiR | grep 80006
> 800060 -rw-rw-r-- 1 ldm ustaff 448 May 20 16:40 DCLSTIDX.PTR
> 800064 -rw-rw-r-- 1 mcidas ustaff 4020 May 20 16:39 ALLOC.WWW
> 800069 -rw-rw-r-- 1 ldm ustaff 22127 May 20 16:35 ROUTEPP.LOG
> 800068 -rw-r--r-- 1 mcidas ustaff 8901480 May 20 10:30 GRID9999
> 800063 -rw-rw-r-- 1 ldm ustaff 530 May 19 21:00 scour.log
> 800067 -rw-rw-r-- 1 mcidas ustaff 20820 Mar 2 16:53 PUBLIC.SRV
> 800061 -rw-rw-r-- 1 mcidas ustaff 292940 Feb 9 10:31 GROUPS.DAT
> 800065 -rw-rw-r-- 1 mcidas ustaff 13628 Jul 5 2000 RESOLV.BAK
> 800066 -rw-rw-r-- 1 mcidas ustaff 622 Jun 14 2000 SETUP.BAT
>
>do you have any thoughts on this issue? Presumably, stopping all the
>mcidas processes will free 11GB.
I just saw your note. I logged onto motherlode and did a 'df -k' that
yielded the following results:
/local/ldm% df -k
Filesystem kbytes used avail capacity Mounted on
/dev/md/dsk/d0 124330 71585 40312 64% /
/dev/md/dsk/d6 1018382 480215 477065 51% /usr
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
mnttab 0 0 0 0% /etc/mnttab
/dev/md/dsk/d3 493688 368986 75334 84% /var
swap 2919408 16 2919392 1% /var/run
/dev/md/dsk/d1 17114522 15399537 1543840 91% /scratch
/dev/md/dsk/d9 17114522 15762532 1180845 94% /home
/dev/md/dsk/d8 17372586 4185845 13013016 25% /opt
/dev/md/dsk/d7 608262842 536361370 65818844 90% /data
/dev/md/dsk/d4 1018382 385138 572142 41% /tmp
Note that /home is at 94% capacity.
I then stopped the LDM and reran 'df -k':
/local/ldm% ldmadmin stop
stopping the LDM server...
LDM server stopped
/local/ldm% df -k
Filesystem kbytes used avail capacity Mounted on
/dev/md/dsk/d0 124330 71585 40312 64% /
/dev/md/dsk/d6 1018382 480215 477065 51% /usr
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
mnttab 0 0 0 0% /etc/mnttab
/dev/md/dsk/d3 493688 368988 75332 84% /var
swap 3501760 16 3501744 1% /var/run
/dev/md/dsk/d1 17114522 15399537 1543840 91% /scratch
/dev/md/dsk/d9 17114522 12626044 4317333 75% /home
/dev/md/dsk/d8 17372586 4185665 13013196 25% /opt
/dev/md/dsk/d7 608262842 536366334 65813880 90% /data
/dev/md/dsk/d4 1018382 389035 568245 41% /tmp
Now /home is 75% fill.
Shutting down the LDM stopped all McIDAS-XCD processes from running.
I cleaned out ~ldm/.mctmp (entries are created by McIDAS processes
running from the LDM; the subdirectories under ~ldm/.mctmp are small);
restarted the LDM; and then reran 'df -k' with the following results:
/local/ldm% ldmadmin start
starting the LDM server...
the LDM server has been started
/local/ldm% df -k
Filesystem kbytes used avail capacity Mounted on
/dev/md/dsk/d0 124330 71585 40312 64% /
/dev/md/dsk/d6 1018382 480215 477065 51% /usr
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
mnttab 0 0 0 0% /etc/mnttab
/dev/md/dsk/d3 493688 369030 75290 84% /var
swap 3560624 16 3560608 1% /var/run
/dev/md/dsk/d1 17114522 15967865 975512 95% /scratch
/dev/md/dsk/d9 17114522 4155606 12787771 25% /home
/dev/md/dsk/d8 17372586 4185488 13013373 25% /opt
/dev/md/dsk/d7 608262842 536366262 65813952 90% /data
/dev/md/dsk/d4 1018382 383522 573758 41% /tmp
I guess the question for me now is exactly what process was holding
open the huge file in /home/mcidas/upcworkdata after it had been deleted!?
The only things that comes to mind is the file ROUTEPP.LOG. This file
contains log output from McIDAS PostProcess BATCH files which are
run upon receipt of image files in the MCIDAS stream. ROUTEPP.LOG
gets deleted by the cron-initiated mcscour.sh process run by 'ldm'.
The puzzling thing is that the PP BATCH processes fire up and log
to ROUTEPP.LOG and then exit. They shouldn't hold open a write to
the file. Again, I am puzzled by this...
Let's chat about what could be going on tomorrow.
Tom
>>From address@hidden Mon May 20 17:05:36 2002
>
>More info -- I forgot about /proc. Look in /proc/8442/fd which is the
>list of open fd's for one of the mcidas pid's and see this;
>
>--w--w---- 0 ldm ustaff 11879671265 May 20 16:17 1
>--w--w---- 0 ldm ustaff 11879671265 May 20 16:17 2
>
>as I recall, fd 1 and 2 is stdout and stderr. You can cat the file;
>
>(motherlode) 234 # more 1
>Starting DDS at 02125.205929
>Domestic Data Service
>205929
>High Resolution Data Service
>ingebin.k: Done
> at 02125.205930
>startxcd.k: m0monexe - first startup of :DMSFC
> 9350
>startxcd.k: m0monexe - first startup of :DMRAOB
> 9351
>startxcd.k: m0monexe - first startup of :DMSYN
> 9352
>startxcd.k: m0monexe - first startup of :DMMISC
> 9353
>startxcd.k: m0monexe - first startup of :DMGRID
> 9354
>...
This is the log output for the starting of XCD processes. This logging
is supposed to be written to upcworkdata/XCD_START.LOG. This file
does not get deleted until the LDM is stopped and restarted.
Each time an XCD process starts, log output gets written to XCD_START.LOG.
Admittedly, XCD processes do start/restart on a regular basis, but
I would never guess that there was enough log output to consume 11 GB
of disk space over the course of the number of days that the LDM is up
and running.
Also, I just took a look at /proc:
/home/mcidas/upcworkdata% cd /proc
/proc% dir */fd/*
I see LOTS of entries that are huge:
...
640300 -rw-rw-r-- 1 ustaff 89991520 May 21 01:24 27253/fd/14
...
640300 -rw-rw-r-- 1 ustaff 89991520 May 21 01:24 27254/fd/13
...
640300 -rw-rw-r-- 1 ustaff 89991520 May 21 01:24 27258/fd/10
...
640300 -rw-rw-r-- 1 ustaff 89991520 May 21 01:24 27259/fd/11
...
640300 -rw-rw-r-- 1 ustaff 89991520 May 21 01:24 27264/fd/18
...
7463818 -rw-rw-r-- 1 ustaff 100237236 May 21 01:23 27260/fd/13
...
17728129 -rw-rw-r-- 1 ustaff 378889216 May 21 01:23 27559/fd/3
...
6637 -r--r--r-- 1 ustaff 1980915712 May 17 11:54 27217/fd/4
6637 -r--r--r-- 1 ustaff 1980915712 May 17 11:54 27218/fd/0
6637 -r--r--r-- 1 ustaff 1980915712 May 17 11:54 27219/fd/0
6637 -r--r--r-- 1 ustaff 1980915712 May 17 11:54 27220/fd/0
6637 -r--r--r-- 1 ustaff 1980915712 May 17 11:54 27222/fd/4
6637 -rw-rw-r-- 1 ustaff 1980915712 May 17 11:54 27334/fd/4
What do these mean?
Could this be the same thing we saw on shemp (when stopping the LDM
resulted in 600 MB magically appearing)?
Tom