[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1 (fwd)
- Subject: Re: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1 (fwd)
- Date: Fri, 15 Feb 2002 12:18:24 -0700
This email has been forwarded to the netCDF support email archive for
archiving.
------- Forwarded Message
Return-Path: address@hidden
Delivery-Date: Thu Feb 14 16:49:43 2002
Received: from arsc.edu (mcgrew.arsc.edu [199.165.84.136])
by unidata.ucar.edu (UCAR/Unidata) with ESMTP id g1ENngx21288;
Thu, 14 Feb 2002 16:49:43 -0700 (MST)
Organization: Arctic Region Supercomputing Center
Keywords: 200202122006.g1CK6Lx24308
Received: from tanana.arsc.edu (tanana.arsc.edu [199.165.84.149])
by arsc.edu (2000-04-24.ARSC) with ESMTP id OAA18619;
Thu, 14 Feb 2002 14:49:41 -0900 (AST)
Received: from localhost (jlm@localhost)
by tanana.arsc.edu (2000-04-25.ARSC) with ESMTP id OAA13249;
Thu, 14 Feb 2002 14:49:41 -0900 (AST)
X-Authentication-Warning: tanana.arsc.edu: jlm owned process doing -bs
Date: Thu, 14 Feb 2002 14:49:41 -0900
From: John Metzner <address@hidden>
To: Steve Emmerson <address@hidden>
cc: address@hidden
Subject: Re: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1
In-Reply-To: <address@hidden>
Message-ID: <address@hidden>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Steve,
The macros.make diff does not show anything significant, just the
$SRCDIR and $prefix differences I would have expected.
I see a number of *.o file size differences between the "good" and
locally built files in src/nctest. The "good"
chilkoot$ ls -l *.o
- -rw------- 1 jlm cray 6848 Feb 13 19:18 add.o
- -rw------- 1 jlm cray 79448 Feb 13 19:18 atttests.o
- -rw------- 1 jlm cray 35640 Feb 13 19:18 cdftests.o
- -rw------- 1 jlm cray 21544 Feb 13 19:18 dimtests.o
- -rw------- 1 jlm cray 3904 Feb 13 19:18 driver.o
- -rw------- 1 jlm cray 1768 Feb 13 19:18 emalloc.o
- -rw------- 1 jlm cray 1536 Feb 13 19:18 error.o
- -rw------- 1 jlm cray 2608 Feb 13 19:18 misctest.o
- -rw------- 1 jlm cray 27480 Feb 13 19:18 rec.o
- -rw------- 1 jlm cray 13552 Feb 13 19:18 slabs.o
- -rw------- 1 jlm cray 7912 Feb 13 19:18 val.o
- -rw------- 1 jlm cray 13368 Feb 13 19:18 vardef.o
- -rw------- 1 jlm cray 5840 Feb 13 19:18 varget.o
- -rw------- 1 jlm cray 6200 Feb 13 19:18 vargetg.o
- -rw------- 1 jlm cray 6040 Feb 13 19:18 varput.o
- -rw------- 1 jlm cray 6312 Feb 13 19:18 varputg.o
- -rw------- 1 jlm cray 30280 Feb 13 19:18 vartests.o
- -rw------- 1 jlm cray 5104 Feb 13 19:18 vputget.o
- -rw------- 1 jlm cray 7376 Feb 13 19:18 vputgetg.o
The locally built "bad": (differences flagged w/ !!!)
chilkoot$ ls -l *.o
- -rw------- 1 jlm software 7000 Feb 13 20:29 add.o !!!
- -rw------- 1 jlm software 79448 Feb 13 20:29 atttests.o
- -rw------- 1 jlm software 35640 Feb 13 20:29 cdftests.o
- -rw------- 1 jlm software 21544 Feb 13 20:29 dimtests.o
- -rw------- 1 jlm software 3904 Feb 13 20:29 driver.o
- -rw------- 1 jlm software 1768 Feb 13 20:29 emalloc.o
- -rw------- 1 jlm software 1536 Feb 13 20:29 error.o
- -rw------- 1 jlm software 2608 Feb 13 20:29 misctest.o
- -rw------- 1 jlm software 27480 Feb 13 20:29 rec.o
- -rw------- 1 jlm software 13576 Feb 13 20:29 slabs.o !!!
- -rw------- 1 jlm software 7560 Feb 13 20:29 val.o !!!
- -rw------- 1 jlm software 13368 Feb 13 20:29 vardef.o
- -rw------- 1 jlm software 5840 Feb 13 20:29 varget.o
- -rw------- 1 jlm software 6200 Feb 13 20:29 vargetg.o
- -rw------- 1 jlm software 6040 Feb 13 20:29 varput.o
- -rw------- 1 jlm software 6312 Feb 13 20:29 varputg.o
- -rw------- 1 jlm software 30280 Feb 13 20:29 vartests.o
- -rw------- 1 jlm software 5160 Feb 13 20:29 vputget.o !!!
- -rw------- 1 jlm software 7520 Feb 13 20:29 vputgetg.o !!!
The src/nctest/nctest binaries are different sizes, of course. The
"good":
- -rwx------ 1 jlm cray 1966216 Feb 14 12:28 nctest
The local "bad":
- -rwx------ 1 jlm software 1965960 Feb 13 20:29 nctest
Also, I found a core file in the bad directory from the 'make test'
run. Thought it might mean something to you.
chilkoot$ debugview core
CrayTools DebugView 3.0.0.35 (Cray version) Mar 12 2001 14:24:46
- ------------------------------------------------------------------
No symbols are available for debugging because the executable has
been stripped or is not accessible. Source-level debugging is not
available, and in some cases, TotalView may fail when allocating
memory for the assembly-code listing. If you are debugging a core
file, running totalview specifying only the core file may help.
- -------------------------------------------------------------------
***** START OF SYMBOLIC DUMP *****
LIST OF PROCESS STATES
PIDs 8610: Signal SIGFPE <Floating point exception>
DISPLAYING PIDs 8610: Signal SIGFPE <Floating point exception>
Signal SIGFPE in routine ncx_putn_float_float at address 0p113671d
ncx_putn_float_float was called by putNCv_float at line 1913 (address
0p134147d)
putNCv_float was called by nc_put_vara_float at line 5675 (address 0p177240d)
nc_put_vara_float was called by nc_put_varm at line 11048 (address 0p251461a)
nc_put_varm was called by ncvarputg at line 624 (address 0p275263c)
ncvarputg was called by test_varputgetg at line 119 (address 0p12067b)
test_varputgetg was called by $STKOFEN at line 52 (address 0p545453b)
$STKOFEN was called by test_ncvarputg at line 52 (address 0p3273b)
test_ncvarputg was called by main at line 66 (address 0p12644d)
main was called by $START$ at line 350 (address 0p1121c)
***** END OF SYMBOLIC DUMP *****
Any thoughts on where to go next to get a good 'make test' run? I'm
thinking of building a 'chroot' environment where I can guarantee I've
eliminated any /usr/local/lib libraries without affecting the real users on
the system. I can make any changes I want within it to isolate the cause of
the failed 'make test'
Thanks for all your time and quick responses. It is much appreciated.
Regards,
John Metzner - Cray, Inc address@hidden
Arctic Region Supercomputing Center address@hidden
910 Yukon Drive Rm. 106E Phone: (907)474-5431
Fairbanks, AK 99775-6020 FAX: (907)474-1820
On Thu, 14 Feb 2002, Steve Emmerson wrote:
> Date: Thu, 14 Feb 2002 15:07:03 -0700
> From: Steve Emmerson <address@hidden>
> To: John Metzner <address@hidden>
> Cc: address@hidden
> Subject: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1
>
> John,
>
> >Date: Thu, 14 Feb 2002 12:48:03 -0900
> >From: John Metzner <address@hidden>
> >Organization: Arctic Region Supercomputing Center
> >To: Steve Emmerson <address@hidden>
> >Subject: Re: 20020212: netcdf 3.5.0 ncvarput failure - Cray SV1
> >Keywords: 200202122006.g1CK6Lx24308
>
> The above message contained the following:
>
> > I'm still working on trying to get netCDF 3.5.0 built and tested on
> > our Cray SV1ex. I tried turning down the optimization level as you
> > suggested
> > to no avail, same error during 'make test'. This was done after a 'make
> > distclean', making sure there was no config.cache and resetting the
> > environment variables. There is one (that I know of) local change to the
> > default library search path which causes /usr/local/lib to be prepended
> > to the library search path (even prempting -L on the command line) which I
> > pulled out. I ran through the full build & test sequence again and got the
> > same error as below.
> > I did pull the netCDF-3.5.0 package inside Cray Corporate, built and
> > tested the package there on a SV1ex. It worked, so the problem is some
> > local
> > system change which is getting in the way.
> > I pulled the package from Cray Corporate back out to the site with
> > the "good" libraries and build products. I reran the 'make test' on it,
> > again
> > without error.
> > Next I copied the locally built libsrc/libnetcdf.a and
> > cxx/linetcdf_c++.a into the proper location for the "good" package from Cray
> > Corporate. A 'make test' ran again without error. I was trying to
> > determine
> > if the problem was in the test code or the libraries built locally. Is that
> > a valid test?
>
> If your locally-built libnetcdf.a library, when copied into the Cray
> Corporate package, results in that package correctly executing a "make
> test", then it would seem that the problem lies in the building and/or
> execution of the netCDF-2 test program rather than with the netCDF
> library functions.
>
> A good way to look at the differences in the build environments is to
> use the "diff" utility on the file "macros.make", which is located in
> the top-level source directory. Does it show anything significant?
>
> Another thing to check is whether or not the files in the netCDF-2 test
> directory, "nctest", are the same.
>
> Regards,
> Steve Emmerson <http://www.unidata.ucar.edu>
>
>From address@hidden Fri Feb 15 12:09:40 2002
>Subject: Re: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1
Steve,
I did a bit more testing with the nctest code, comparing builds
between here at ARSC and inside Cray Corporate. I was able to get the
nctest code to build and run successfully here when I changed the CFLAGS
entry in the macros.make file from "-O3" to "-h inline3,scalar3,task1,vector0".
Also "-O0" would work, but not "-O1" (-O1 is equivalent to -h inline1,scalar1,
task1,vector1).
I found that the versions of C/C++ compilers was slightly different
between here and the Cray Corporate machine. We are running version 3.5.0.1
and the corporate system was 3.5.0.3. When I changed to the same 3.5.0.1
compiler on the corporate machine, I got the same failure. The problem was
still there when I switched to 3.5.0.2 on the corporate system. So, Cray
made some change to the compiler at 3.5.0.3 which allows nctest to not error
out on a floating point exception.
You might want to enter this into you problem/fix database in case
some other poor Cray soul gets bit by it. Thanks for all your help and
quick responses. It's great to get this kind of support on an open source
package, pretty rare too.
Thanks again,
John Metzner - Cray, Inc address@hidden
Arctic Region Supercomputing Center address@hidden
910 Yukon Drive Rm. 106E Phone: (907)474-5431
Fairbanks, AK 99775-6020 FAX: (907)474-1820
On Thu, 14 Feb 2002, Steve Emmerson wrote:
> Date: Thu, 14 Feb 2002 15:07:03 -0700
> From: Steve Emmerson <address@hidden>
> To: John Metzner <address@hidden>
> Cc: address@hidden
> Subject: 20020214: netcdf 3.5.0 ncvarput failure - Cray SV1
>
> John,
>
> >Date: Thu, 14 Feb 2002 12:48:03 -0900
> >From: John Metzner <address@hidden>
> >Organization: Arctic Region Supercomputing Center
> >To: Steve Emmerson <address@hidden>
> >Subject: Re: 20020212: netcdf 3.5.0 ncvarput failure - Cray SV1
> >Keywords: 200202122006.g1CK6Lx24308
>
> The above message contained the following:
>
> > I'm still working on trying to get netCDF 3.5.0 built and tested on
> > our Cray SV1ex. I tried turning down the optimization level as you
> > suggested
> > to no avail, same error during 'make test'. This was done after a 'make
> > distclean', making sure there was no config.cache and resetting the
> > environment variables. There is one (that I know of) local change to the
> > default library search path which causes /usr/local/lib to be prepended
> > to the library search path (even prempting -L on the command line) which I
> > pulled out. I ran through the full build & test sequence again and got the
> > same error as below.
> > I did pull the netCDF-3.5.0 package inside Cray Corporate, built and
> > tested the package there on a SV1ex. It worked, so the problem is some
> > local
> > system change which is getting in the way.
> > I pulled the package from Cray Corporate back out to the site with
> > the "good" libraries and build products. I reran the 'make test' on it,
> > again
> > without error.
> > Next I copied the locally built libsrc/libnetcdf.a and
> > cxx/linetcdf_c++.a into the proper location for the "good" package from Cray
> > Corporate. A 'make test' ran again without error. I was trying to
> > determine
> > if the problem was in the test code or the libraries built locally. Is that
> > a valid test?
>
> If your locally-built libnetcdf.a library, when copied into the Cray
> Corporate package, results in that package correctly executing a "make
> test", then it would seem that the problem lies in the building and/or
> execution of the netCDF-2 test program rather than with the netCDF
> library functions.
>
> A good way to look at the differences in the build environments is to
> use the "diff" utility on the file "macros.make", which is located in
> the top-level source directory. Does it show anything significant?
>
> Another thing to check is whether or not the files in the netCDF-2 test
> directory, "nctest", are the same.
>
> Regards,
> Steve Emmerson <http://www.unidata.ucar.edu>
>