[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20030318: ldm on waldo at stc; McIDAS-XCD scouring moved to 'ldm' account
- Subject: 20030318: ldm on waldo at stc; McIDAS-XCD scouring moved to 'ldm' account
- Date: Tue, 18 Mar 2003 12:08:48 -0700
>From: "Anderson, Alan C. " <address@hidden>
>Organization: St. Cloud State
>Keywords: 200303181640.h2IGeXB2004042 LDM-6 McIDAS-XCD scour
Alan,
>Noticed that our ldm has stopped getting data from papagayo
>as of about 10Z on 17 Mar. My log files seemed ok up to that
>time, then data log stopped. I have checked with Clint, see
>his response below.
>Any suggestions.
OK. The messages in Clint's log file confirm/demonstrate the inability
of his LDM to send you data.
>Have stopped and restarted my ldm this morning, but it is still
>not ingesting.
I logged on and was able to run notifyme to papagayo to verify that
nothing has changed on Clint's side (allows, etc.):
<as 'ldm'>
notifyme -vxl- -f ANY -o 3600 -h papagayo.unl.edu
Data lists came back immediately proving that Clint's machine is
correctly setup to allow feeds from waldo.
I then ran top and noticed that the load average on waldo was 44.
Since this is extremely unusual, I decided to shutdown the LDM
and run some checks on the queue.
/usr/local/ldm% pqcat -s -q data/ldm.pq -l-
Mar 18 18:28:36 pqcat: Starting Up (9152)
Mar 18 18:28:36 pqcat: assertion "IsAlloc(rep)" failed: file "pq.c", line 1907
Abort (core dumped)
This looked as though the queue was corrupted, so I decided to try and
delete and remake it:
/usr/local/ldm% ldmadmin delqueue
/usr/local/ldm/data/ldm.pq: No such file or directory
After verifying that there was still a link between /var/data/ldm and
/usr/local/ldm/data, I looked for a queue:
/usr/local/ldm% cd data
/usr/local/ldm/data% ls -alt
total 22
drwxr-xr-x 5 ldm data 512 Mar 18 18:28 ./
drwxr-xr-x 2 ldm data 6656 Mar 18 16:08 logs/
drwxrwxr-x 4 ldm data 512 Nov 6 21:01 gempak/
drwxrwxr-x 3 ldm data 512 Sep 25 01:00 surface/
drwxrwxr-x 4 ldm data 512 Nov 24 1999 ../
So, your problem was that your LDM queue somehow got deleted!
I remade the queue and then restarted your LDM:
/usr/local/ldm% ldmadmin mkqueue
/usr/local/ldm% ldmadmin start
Data is once again flowing into waldo. Now, the question is how the
LDM queue got deleted!?
While I was on waldo, I decided to move the scouring of McIDAS-XCD
produced data files to the 'ldm' account:
<as 'ldm'>
cd util <- ~ldm/util is in the PATH for 'ldm'
cp ~mcidas/workdata/mcscour.sh .
<I looked at the contents of mcscour.sh to make sure that all the
environment variables are set correctly, and they are>
I changed the mcscour.sh logging from /home/mcidas/workdata/scour.log
to ~ldm/logs/mcscour.log. This puts almost all of your LDM related
log files into ~ldm/logs. The only one that I didn't move/change
was /home/mcidas/workdata/ROUTEPP.LOG. This can easily be moved
by editing the MCLOG setting in ~ldm/decoders/batch.k.
Next, I moved McIDAS ADDE server logging from ~mcidas/workdata to
~ldm/logs. This required that I:
o setup a McIDAS REDIRECTion for SERVER.* in the 'mcidas' account
o change the permissions on /var/data/ldm/logs so that it was
group writable (mcidas and mcadde are in the same group as ldm)
o move ~mcidas/workdata/SERVER.LOG to ~ldm/logs and change its
permission to be writable by mcadde
o add a cron entry to 'ldm's crontab to rotate the SERVER.LOG* files
Then, since the dostats action is commented out in 'ldm's crontab
file, I edited ~ldm/etc/ldmd.conf to stop pqbinstats from running.
This prevents the .stats files from being created in ~ldm/logs.
This is necessary since the bin/ldmadmin dostats action normally
run from cron is what scours the .stats files.
The last thing I did was run ~ldm/util/mcscour.sh "by hand" as 'ldm'
to make sure that it worked. It apparently does since the March 16
.XCD file in /var/data/mcidas and its associated .IDX files were
scoured off. This leaves that file system with about 3.5 GB of
space:
% df -k
Filesystem kbytes used avail capacity Mounted on
/proc 0 0 0 0% /proc
/dev/dsk/c0d0s0 7396768 3681199 3641602 51% /
fd 0 0 0 0% /dev/fd
swap 802576 312 802264 1% /tmp
Recap:
- the LDM was not receiving data since something had deleted the LDM
queue even though the LDM was till running. I remade the queue
and restarted the LDM. Data is being received and processed
normally once again
- I moved the XCD scouring to an 'ldm' cron job and move the log
file to ~ldm/logs/mcscour.log
- I move the McIDAS ADDE remote server logging to ~ldm/logs and setup
a cron entry to rotate the log files
- I stopped pqbinstats from being run at LDM startup
We need to keep an eye on the McIDAS-XCD scouring done by mcscour.sh
to make sure that it continues to work.
Please let me know if you see anything amiss on waldo.
Tom
>-----Original Message-----
>From: Clint Rowe [mailto:address@hidden]
>Sent: Tuesday, March 18, 2003 10:33 AM
>To: Anderson, Alan C.
>Subject: Re: ldm at papagayo
>
>
>Alan,
>I seem to have all the data and papagayo's been chugging along without any
>problems. There are some errors regarding waldo in yesterday's log file:
>
>Mar 17 10:10:08 papagayo waldo(feed)[4767]: up6.c:168: HEREIS: RPC: Unable to
>send; errno = Broken pipe
>Mar 17 10:10:08 papagayo waldo(feed)[4767]: up6.c:369: Product send failure: I
> /O
>error
>Mar 17 10:10:16 papagayo rpc.ldmd[28230]: child 4767 exited with status 6
>
>...
>
>Mar 17 10:21:58 papagayo waldo(feed)[28849]: up6.c:168: HEREIS: RPC: Unable to
>
>send; errno = Broken pipe
>Mar 17 10:21:58 papagayo waldo(feed)[28849]: up6.c:369: Product send failure:
>I/O error
>Mar 17 10:22:06 papagayo rpc.ldmd[28230]: child 28849 exited with status 6
>
>...
>
>Mar 17 10:35:22 papagayo waldo(feed)[28847]: up6.c:168: HEREIS: RPC: Unable to
>
>send; errno = Broken pipe
>Mar 17 10:35:22 papagayo waldo(feed)[28847]: up6.c:369: Product send failure:
>I/O error
>Mar 17 10:35:30 papagayo rpc.ldmd[28230]: child 28847 exited with status 6
>
>I think the problem is at your end, as I'm getting data and nobody else has
>complained.
>
>Let me know if you can't get restarted.
>Clint
>
>
>>
>>Hi Clint
>>
>>We stopped getting data from papagayo yesterday, Mar. 17 at about 10Z
>>
>>Is there a problem at unl ?
>>
>>Alan Anderson
>>St. Cloud State