[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
19990309: Route Post Process Failure
- Subject: 19990309: Route Post Process Failure
- Date: Tue, 09 Mar 1999 17:51:02 -0700
>From: "Jennie L. Moody" <address@hidden>
>Organization: UVa
>Keywords: 199903091926.MAA12825 McIDAS ROUTE SYSIMAGE.SAV
Jennie,
>Hi. I am having a bad day. We stopped getting McIDAS data yesterday
>(after I encouraged my class to pay attention to the satellite imagery
>as this storm developed, natch), any way after trying to diagnose what
>might be wrong I finally found the problem was at our host's end.
You know that you could have used ADDE to look at imagery on other
people's machines ;-)
>o Jeff Wolfe fixed his system, and data started flowing.
Good.
>I am working at home, but I could see that new Unidata-McIDAS feed
>files were sitting in the /incoming/data/mcidasd directory, so I
>thought things were going to be okay.
>
>o the mcidas.log file showed products were coming and and getting
> logged
There is a strange message in ~ldma/logs/mcidas.log, however:
Mar 09 23:32:38 lwtoa3[27414]: Starting Up
Mar 09 23:32:38 lwtoa3[27414]: unsetting MCPATH environment variable
Mar 09 23:32:38 lwtoa3[27414]: changing to directory /incoming/data/mcidasd
Mar 09 23:32:38 lwtoa3[27414]: decoding "LWTOA3 163 DIALPROD=UA 99068 233222"
Mar 09 23:32:38 lwtoa3[27414]: PRODUCT CODE=UA 99068 233222
Mar 09 23:32:40 lwtoa3[27414]: Done -- AREA= 166
Mar 09 23:32:40 lwtoa3[27414]: Exiting
The line that is unsettling is "unsetting MCPATH environment variable".
I don't know if this is something to worry about, or simply a reflection
of your having verbose logging turned on.
>o a run of route.k LIST showed me that the routing table was being
> updated.
So far, so good.
>However, I found there were core files being dumped by the ldma user
>(who runs the PPBatch files).
You don't say where.
>I went to look at the ROUTEPP.LOG file
>and it didn't exist?
Not good.
>I thought maybe having the McIDAS data just start
>flowing again when my ldm had been running might have caused some
>problem, so I stopped and restarted the ldm. (The clueless turn it off
>and turn it back on response, which I don't like).
Actually, this is not a bad strategy for the clued.
>Since Tom told me
>if you are restarting the ldm right away, there is no need to rebuild
>the queue,
Actually, my comment really had no time reference to it. The only time
that you need to remake the queue is when it gets corrupted. The reason
to _not_ remake the queue if you don't have to is that when you do remake
it, you then implicitly are requesting the past hour's worth of data from
your upstream feed site.
>I didn't (though I had remade the queue earlier today, when
>I was swithching hosts, etc. And my system had been receiving xcd
>products as I noted earlier.)
OK.
>Anyway:
>
>o I am still getting core dumps
Where? What file(s) is(are) dumping the core files? To find this out,
type:
file core
>o the McIDAS products are coming in, but we are not
> "post processing" them.
OK.
>o there is no ROUTEPP.LOG file to look at, so I am stumped
OK.
>Any help would be appreciated. I would like to have stuff
>back up for class discussion tomorrow.
I am on your machine as I write this.
>------------
>In a separate item, the only response to my needdata request so far has
>been from someone who archived gempak grids. Is it insane to think I
>might be able to get these files from gempak back into GRIB or into
>some other form that McIDAS could read (of course I cannot do this here
>since I don't run GEMPAK...
It may not be insane, but it is somewhat demented :-)
>would like to, but I have my hands full
>just trying to keep on top of McIDAS/LDM along with all the things I am
>supposed to do).
I understand.
>-----------
>
>BTW: if its Tom reading this message,
I am.
>just so you don't think we have
>anything really messed up on our system (like our clock) after my
>question yesterday, Owen and I are the ones who were messed up, not the
>system, wasn't there some adage about user's always wrong?...
Don wrote the note on my white board: 99% of all errors are user induced!!!
>Anyway, I
>am happy to hear what I/we have done wrong currently to cause our post
>processing to fail...it had been working up until we stopped receiving
>data from PSU, now that we have data flowing, we have no
>post-processing.
OK.
>If you are getting frustrated by my questions, you can perhaps imagine
>how frustrated I am about continually asking for assistance!
I'm not getting frustrated by the questions. After all, it is my job.
>(By the way, even diagnosing things is a hassle working from home,
Hmm... I now have a connection from Sugarloaf into the NCAR/UCAR RAS
(Remote Access Service) through a 28.8 modem. I worked from home for
the first time yesterday afternoon/evening and found the experience
to be very enjoyable.
>cause my line is 30 minutes, timed, and I keep losing my connection.
So, no matter if you are using the line, it disconnects after 30
minutes? This sucks!
>If there is any reason a phone call would help, I am at 804-977-0910.
>Thanks.
>
>>From address@hidden Tue Mar 9 14:52:11 1999
>
>o I looked around and found that core dumped a bunch of files in
> /incoming/data that shouldn't be there (ALLOC.WWW, FRAME.001,
> TERMCHAR.001, SYSIMAGE.SAV) so I deleted them (recalling this was a
> problem in the past when we had ppbatch failure)
I found the same files there also, so I deleted them. The weird part
is that ALLOC.WWW never gets created there. It is almost as though
MCPATH for the ROUTE PP BATCH session had the /incoming/data/mcidasd
directory as its _first_ directory. If this is the case, then all
bets are off in terms of REDIRECTions, etc.
>o I stopped the lmd, then deleted these files, then found one of them
> (FRAMENH.001) in workdata, and mv'd it to a new name (I kept it cause
> I was trying to figure out from its time stamp what the hell was
> going on)
I saw that one also.
>o I restarted the ldm
>
>o first mcidas file that came in after restarting was an MDR, triggering
> MDR.BAT
>
>o the batch failed, dumped core and all these nasty files again, and
> still didn't write anything to ROUTEPP.LOG, which clearly isn't
> getting made.
OK.
>Just wanted to let whoever might log in (please do!) know that since my
>last message, I got in and tried to fix things up...but I don't have
>notes here from the last time we had pp-batch problems, and I cannot
>recall everything that was a problem.
>
What I did was:
o logon as 'mcidas'
o cd to workdata
o run DMAP:
dmap.k \*.001
The listing I got showed that these files were in directories other
than /incoming/data/mcidasd:
windfall: /p0/users/mcidas/workdata % dmap.k \*.001
PERM SIZE LAST CHANGED FILENAME DIRECTORY
---- --------- ------------ ----------------- ---------
-r-- 6528 Mar 09 16:09 FRAMENH.001 /home/mcidas/help
-rw- 58752 Mar 09 12:43 FRAMENH.001.keep? /home/mcidas/workdata
-r-- 45056 Mar 09 16:09 TERMCHAR.001 /home/mcidas/help
110336 bytes in 3 files
Not good!
>I surrender, take away my Ph.D. I do recall that everything gets hosed
>when these temp files get in the PATH for the ldm (or user mcidas), and
>every process after will keep finding them first.
You remember correctly.
>But, I don't know
>why they are dumping here in the first place.
I don't either.
>I loathe this
>aggravation, I should have spent this time grading, writing, etc.
I understand your frustration.
>At
>least its snowing, I'm going to go play with my kids in the snow and
>hope that someone out there can help.
Well... And I am on your machine sniffing out bugs. Where's the justice
in that ;-)
So, while I was sitting here, I experienced another instance of SYSIMAGE.SAV
being created in the /home/mcidas/workdata directory. I suspect that this
is either due to bad settings in /home/ldma/util/batch.k or something
much worse. I really don't suspect anything bad in /home/ldma/util/batch.k
since it has been working right along. Something worse would require that
the machine be rebooted.
Further probing reveals that your 'ldma' user can't create files in
the /home/mcidas/workdata directory:
windfall: /home/ldma/util $ touch /home/mcidas/data/ROUTEPP.LOG
touch: /home/mcidas/data/ROUTEPP.LOG cannot create
Not being able to create files in that directory would account for the
log file not existing. What other ripple effect it might have is not
known by me off of the top of my head.
Strangely enough, the permissions on /home/mcidas/workdata looked like
'ldma' should have been able to write there:
drwxrwxr-x 3 mcidas mcidas 5632 Mar 9 18:50 workdata/
This shows read/write/execute for owner and group and read/execute for
world. I quickly checked to make sure that 'ldma' and 'mcidas' are
still in the same group:
windfall: /p0/users/mcidas % id mcidas
uid=101(mcidas) gid=101(mcidas)
windfall: /p0/users/mcidas % id ldma
uid=100(ldma) gid=101(mcidas)
Since they are in the same group, the directory permissions above should
have been sufficient. In order to try and get things working, I changed
the permissions to 777:
windfall: /p0/users/mcidas % chmod 777 workdata
I could then create /home/mcidas/workdata/ROUTEPP.LOG as 'ldma'.
As I continued to browse around looking for things that might be wrong,
I decided that it would be interesting to see how long this machine has
been up:
windfall: /home/mcidas/workdata % uptime
7:27pm up 61 day(s), 8:51, 4 users, load average: 0.61, 0.21, 0.13
It may well be that some part of the operating system has gone south
and the machines needs rebooting!
One thing that I see that shouldn't have any ill effects are:
ls -l /home/ldma/SYSKEY.TAB
-rw-r--r-- 1 ldma mcidas 24000 May 4 1998 /home/ldma/SYSKEY.TAB
ls -l /home/ldma/etc/ROUTE.SYS
-rw-r--r-- 1 ldma mcidas 7168 May 4 1998 /home/ldma/etc/ROUTE.SYS
Why are these files there?
As a last ditch effort (shy of rebooting), I did the following as 'ldma':
o ldmadmin stop
o windfall: /incoming/data/mcidasd $ ipcs
IPC status from <running system> as of Tue Mar 9 19:18:19 1999
T ID KEY MODE OWNER GROUP
Message Queues:
Shared Memory:
m 200 00000000 --rw------- ldma mcidas
Semaphores:
windfall: /incoming/data/mcidasd $ cd /home/ldma/.mctmp
windfall: /home/ldma/.mctmp $ ls -al
total 12
drwx------ 5 ldma mcidas 512 Mar 8 22:37 ./
drwxrwxr-x 13 ldma mcidas 2048 Mar 9 19:16 ../
drwx------ 2 ldma mcidas 512 Feb 10 18:10 101/
drwx------ 2 ldma mcidas 512 Feb 10 18:10 102/
drwx------ 2 ldma mcidas 512 Feb 2 16:09 200/
windfall: /home/ldma/.mctmp $ ps -eaf | grep mcenv
jlm8h 89 1 0 Mar 08 pts/11 0:01 mcenv -k 73 -f 11@700x864 -f
5@480x640 mctext -iw -c !@exec mcimage -igeometry
ldma 27853 27301 0 19:19:02 pts/23 0:00 grep mcenv
windfall: /home/ldma/.mctmp $ ipcrm -m 200
windfall: /home/ldma/.mctmp $ ipcs
IPC status from <running system> as of Tue Mar 9 19:19:15 1999
T ID KEY MODE OWNER GROUP
Message Queues:
Shared Memory:
Semaphores:
windfall: /home/ldma/.mctmp $ ls
./ ../ 101/ 102/ 200/
windfall: /home/ldma/.mctmp $ /bin/rm -rf *
windfall: /home/ldma/.mctmp $ ls
./ ../
o windfall: /home/ldma $ ldmadmin start
starting the LDM server...
the LDM server has been started
windfall: /home/ldma $ ls /incoming/data/mcidasd
...
<looking for *.001, SYSIMAGE.SAV, core, etc. found none, so I moved on>
As the user 'mcidas':
o cd /home/mcidas/workdata
o route.k LIST
S Pd Description Range Last Received Post Process C
- -- ------------------------- --------- ------------ ---------- ------------ -
CI GOES-8/9 IR Composite 80-89 AREA0086 99069 35 none 3
CV GOES-8/9 VIS Composite 90-99 AREA0092 99069 37 none 3
CW GOES-8/9 H2O COMPOSITE 70-79 AREA0071 99069 38 H2OCOMP.BAT 3
LD NLDN Lightning Flashes 71-71 none none none 3
MA Surface MD data default MDXX0009 99069 30 SFC.BAT 3
N1 GOES-8 IR/TOPO Composite 220-229 AREA0220 99069 35 none 3
N2 GOES-8 VIS/TOPO Composite 230-239 AREA0232 99069 37 none 3
N3 GOES-9 IR/TOPO Composite 240-249 AREA0243 99042 1718 none 3
N4 GOES-9 VIS/TOPO Composite 250-259 AREA0252 99042 1626 none 3
N5 MDR/TOPO Composite 260-269 AREA0267 99067 1006 none 3
N6 Mollweide IR/TOPO Composi 270-279 AREA0275 99067 1031 none 3
N7 GOES-8/9 IR/TOPO Composit 280-289 AREA0282 99069 35 none 3
N8 GOES-8/9 VIS/TOPO Composi 290-299 AREA0296 99069 37 none 3
NF Global Initialization Gri 101-106 GRID0101 99068 2237 GLOBAL.BAT 3
NG Early Domestic Products 1-40 GRID0039 99068 1641 ADDGRID.BAT 3
R1 Base Reflectivity Tilt 1 300-339 none none none 3
R2 Base Reflectivity Tilt 2 340-379 none none none 3
R3 Base Reflectivity Tilt 3 380-419 none none none 3
R4 Base Reflectivity Tilt 4 420-459 none none none 3
R5 Composite Reflectivity 460-499 none none none 3
R6 Layer Reflect SFC-24 K ft 500-539 none none none 3
R7 Layer Reflect 24-33 K ft 540-579 none none none 3
R8 Layer Reflect 33-60 K ft 580-619 none none none 3
R9 Echo Tops 620-659 none none none 3
RA Vertical Liquid H2O 660-699 none none none 3
RB 1-hour Surface Rain Total 700-739 none none none 3
RC 3-hour Surface Rain Total 740-779 none none none 3
RD Storm Total Rainfall 780-819 none none none 3
RE Radial Velocity Tilt 1 820-859 none none none 3
RF Radial Velocity Tilt 2 860-899 none none none 3
RG Radial Velocity Tilt 3 900-939 none none none 3
RH Radial Velocity Tilt 4 940-979 none none none 3
RI 248 nm Base Reflectivity 980-1019 none none none 3
RJ Storm-Rel Mean Vel Tilt 1 1020-1059 none none none 3
RK Storm-Rel Mean Vel Tilt 2 1060-1099 none none none 3
RM Mandatory Upper Air MD da default MDXX0017 99067 232 MAN.BAT 3
RS Significant Upper Air MD default MDXX0027 99067 233 SIG.BAT 3
U1 Antarctic IR Composite 190-199 AREA0195 99069 11 none 3
U2 FSL2 hourly wind profiler default none none none 3
U3 Manually Digitized Radar 200-209 AREA0201 99069 6 MDR.BAT 3
s U4 Unidata-Wisconsin hourly default none none PROFILER.BAT 3
U5 GOES-9 Western US IR 130-139 AREA0130 99069 27 IR9.BAT 3
U6 FSL2 6-minute Wind profil default none none none 3
U9 GOES-9 Western US VIS 120-129 AREA0122 99069 27 VIS9.BAT 3
UA Educational Floater I 160-169 AREA0167 99069 31 none 3
UB GOES-9 Western US H2O 170-179 AREA0172 99069 27 H2O9.BAT 3
UC Educational Floater II 60-69 AREA0062 99069 33 none 3
UI GOES-8 North America IR 150-159 AREA0155 99069 35 IR8.BAT 3
UM Administrative Message default none none none 1
UR Research Floater 180-189 none none none 3
US Undecoded SAO Data default UNIDATAS 99069 31 none 1
UV GOES-8 North America VIS 140-149 AREA0146 99069 37 VIS8.BAT 3
UW GOES-8 North America H2O 210-219 AREA0216 99069 36 H2O8.BAT 3
UX Mollweide Composite IR 100-109 AREA0100 99068 2231 MOLL.BAT 3
UY Mollweide Composite H2O 110-119 AREA0113 99068 2236 none 3
At the time of the listing, the DAY was 99069, and the time was about
40 past the hour. This listing shows that the ROUTE PostProcessing is
once again working. Furthermore, there are no core, *.001, SYSIMAGE.SAV,
etc. files in /incoming/data/mcidasd. My guess, therefore, was that the
problem was caused by whatever processes had allocated the shared memory
segment with id '200'. My removing this segment while the LDM was off
and my removing the directories under /home/ldma/.mcmtp apparently cleared
up the problem.
I'll have to think about this one to figure out if the problem was caused
by switching the LDM feed (!?).
Later...
Tom