[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: the saga continues
- Subject: Re: the saga continues
- Date: Tue, 20 Aug 2002 12:39:30 -0600 (MDT)
On Mon, 19 Aug 2002, Benjamin Cotton wrote:
> Anne,
>
> I made the changes to ldmd.conf, ldmadmin and pqact.conf that you
> suggested. I have logs now, but still no satellite data. The rest of
> my data is either incomplete and/or a day late. The interesting thing
> is that the incoming data is only about 20 seconds late. So something
> is getting lost in the shuffle. I knew I shoulda got luggage tags.
> Haha anyway? well at least we?re getting somewhere.
>
> Ben
>
> P.S. My horoscope for today read in part: ?The world is your oyster.? I
> just can?t escape those oysters...
>
> ===================
> Benjamin J. Cotton
> LDM Administrator
> Department of Earth and Atmospheric Science,
> Purdue University
>
> 165 Cary Quadrangle cell: (502) 551-5403
> West Lafayette, IN 47906 campus: (765) 49-52298
>
> address@hidden
> www.eas.purdue.edu/~bcotton
>
>
Hi Ben,
I'm glad that the logging is working now. Did you see these messages in
the log?
Aug 20 16:10:59 flood[16519]: run_requester: 20020820154739.805 TS_ENDT
{{NNEXRAD|DIFAX|UNIDATA, ".*"}}
Aug 20 16:10:59 flood[16519]: FEEDME(flood.atmos.uiuc.edu): OK
Aug 20 16:11:00 flood[16519]: pq_del_oldest: conflict on 40654520
Aug 20 16:11:00 flood[16519]: hereis: pq_insert failed: Resource
temporarily unavailable: 68934acfb1d8e490a644914e27bbe686 8539
20020820154900.986 NNEXRAD 030 SDUS53 KMQT 201546 /pN0RMQT
Aug 20 16:11:00 flood[16519]: Connection reset by peer
Aug 20 16:11:00 flood[16519]: Disconnect
YOur connection to flood is continually being broken and reestablished,
presumably because anvil's disk is unavailable. This can be caused by
a full disk, but your disk isn't full:
(anvil.eas.purdue.edu) [/project/ldm]% cd data
(anvil.eas.purdue.edu) [/project/ldm/data]% df -k .
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/ad0s1g 20202730 15325911 3260601 82% /net/anvil
So, back to this problem in a moment.
I see that you are requesting lots of data from flood:
request DIFAX|UNIDATA|NNEXRAD ".*" flood.atmos.uiuc.edu
Do you really want or need the entire NEXRAD feed? I see that you're
filing only the N0R products. If you don't need the entire feed for
relay purposes I strongly suggest that you only request the N0R products
from flood as that is a small percent of the entire feed.
I see that there are lots of 'find' processes running on anvil:
(anvil.eas.purdue.edu) [/project/ldm/etc]% ps -ax | grep find
875 ?? D 48:19.36 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
1795 ?? D 27:29.13 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
6277 ?? D 69:01.29 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
8663 ?? D 242:41.01 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
10368 ?? D 83:40.45 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
10826 ?? D 199:54.24 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
12099 ?? DN 61:37.30 find -s / ! ( -fstype ufs ) -prune -or -path
/tmp -prune -or -path /usr/tmp -prune -or -path
12588 ?? D 144:34.92 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
13471 ?? D 261:46.83 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
13691 ?? D 105:58.58 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
17146 ?? D 117:23.19 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
23366 ?? DN 184:42.46 find -s / ! ( -fstype ufs ) -prune -or -path
/tmp -prune -or -path /usr/tmp -prune -or -path
92142 ?? D 8:20.79 find /net/anvil -xdev -type f ( -perm -u+x -or
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
These processes are owned by root. I
don't know how these are being generated. Are they for security
reasons? Are they necessary? Some have been running for many hours. On
a full disk like /net/anvil, they can be *very* disk intensive, and could
contributed the the disk being unavailable.
Until I noticed that the find processes were owned by root I thought
perhaps the find was coming from the LDM scour program, so I
checked the LDM crontab to see how scour was being invoked. I found this:
(anvil.eas.purdue.edu) [/project/ldm/etc]% crontab -l | grep scour
35 * * * * /project/ldm/bin/scour_anvil > /dev/null
I looked at scour_anvil - it's running your own scour program, so I have
no idea what that's doing. Could that be invoking the 'find' processes
above??
Given the delays you're experiencing with the filed data, I wondered if
pqact was keeping up. So, I put it in verbose mode, and grabbed this out
of the log:
Aug 20 17:57:21 pqact[16518]: 149 20020820140009.143 HDS 121
NXUS65 KPSR 201359 /pGSMIWA
Aug 20 17:57:21 pqact[16518]: 7332 20020820140009.152 NNEXRAD 122
SDUS25 KABQ 201357 /pN1SABX
Aug 20 17:57:21 pqact[16518]: 6158 20020820140009.154 NNEXRAD 123
SDUS23 KGRB 201352 /pN2RGRB
Aug 20 17:57:21 pqact[16518]: 18639 20020820140009.171 NNEXRAD 124
SDUS54 KFWD 201351 /pNCRGRK
Aug 20 17:57:21 pqact[16518]: 1625 20020820140009.311 HDS 130
SDUS83 KMKX 201351 /pDPAMKX
Aug 20 17:57:21 pqact[16518]: 5429 20020820140009.186 NNEXRAD 125
SDUS33 KMKX 201351 /pN3RMKX
Aug 20 17:57:21 pqact[16518]: 8470 20020820140009.188 NNEXRAD 126
SDUS24 KAMA 201358 /pN1RAMA
Aug 20 17:57:21 pqact[16518]: 5685 20020820140009.203 NNEXRAD 127
SDUS75 KABQ 201357 /pN1VABX
Aug 20 17:57:21 pqact[16518]: 9936 20020820140009.205 NNEXRAD 128
SDUS56 KSGX 201357 /pN0RNKX
This shows us that pqact is running about 3 hours behind - it's not able
to keep up with the volume of data. Perhaps killing the 'find' processes
will free up some of the disk for pqact.
I also think that your 300MB queue is too small for what you are
requesting. If you look at the data volumes in your workshop
binder, you'll see that on that day the NEXRAD max was 150MB. (Over the
past 24 overs the NEXRAD max was 206MB.) Let's just look at the math
based on the volumes in the workshop notebook:
NEXRAD 150
HDS 141
IDS|DDPLUS 5.5
DIFAX 7
MCIDAS 7
So you would be unable to keep an hour's worth of data in a 300MB queue.
However, pqmon is reporting that you have several hours worth of data in
your queue (see the "age" column, the age of the oldest product in the
queue in seconds):
(anvil.eas.purdue.edu) [/project/ldm/data]% pqmon -i2
Aug 20 17:15:47 pqmon: Starting Up (29021)
Aug 20 17:15:47 pqmon: nprods nfree nempty nbytes maxprods maxfree
minempty maxext age
Aug 20 17:15:47 pqmon: 40458 1 32783 299996608 56208 7
17033 6720 10350
Aug 20 17:15:49 pqmon: 40458 1 32783 299996608 56208 7
17033 6720 10352
Aug 20 17:15:51 pqmon: 40458 1 32783 299996608 56208 7
17033 6720 10354
I assume this is because writes to the queue aren't succeeding, based on
the 'resource temporarily unavailable' message.
So, in summary, I recommend killing all the 'find' processes and figuring
out where they are coming from and whether they're necessary. I also
recommend cutting back on your NEXRAD request, or, if you must have
it all, then using a bigger queue, at least 500MB. Then if these don't
solve the problem I would stop filing so much data and see if the results are
better. Then you can gradually add in more filing until you find the
threshold where things start falling apart.
This is a little scattered. Please let me know if you have any questions.
Anne
--
***************************************************
Anne Wilson UCAR Unidata Program
address@hidden P.O. Box 3000
Boulder, CO 80307
----------------------------------------------------
Unidata WWW server http://www.unidata.ucar.edu/
****************************************************