[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20041027: error writing to NFS netCDF file on Linux cluster
- Subject: 20041027: error writing to NFS netCDF file on Linux cluster
- Date: Wed, 27 Oct 2004 10:40:17 -0600
Dear Professor Constantinescu,
We might not be able to help you very much because the problem appears
to be due to the behavior of NFS on your Linux cluster rather than with
the netCDF library itself. As you observed
-- Error only occurs while writing files to a directory of an NFS
filesystem (desired).
-- Error does not occur (works fine!) when writing to local /tmp.
(each process writes to its local /tmp). (not desired, since
result files are scattered across the cluster).
The primary person responsible for the netCDF package is attending a
conference at this time. He did, however, have the following to say:
... the problem described will be difficult to debug because it
appears to be dependent on an NFS problem with a Linux cluster
that we probably can't reproduce here. If he could supply a small
complete example that failed, we could try to duplicate the problem,
but if it depends on the details of the NFS implementation and
running on a cluster, that may be difficult.
Version 3.6.0-beta6 of netCDF is also now available, although I
don't recognize any bugs we fixed from 3.5.1 that would be relevant
to this problem.
Professor Constantinescu may not know about the parallel netCDF
package available from
http://www-unix.mcs.anl.gov/parallel-netcdf/
that may be a better solution to his problem. It would require
changes to his code, since the netCDF interface is a little
different, but it is based on MPI and has been successfully used in
several similar modeling projects. The pnetcdf developers may also
be more familiar with the symptoms he describes, since they have
debugged many problems with parallel netCDF I/O, MPI, and clusters.
There is a mailing list address@hidden for discussion
of their parallel netCDF software that might be able to help.
--Russ
Can you reduce the scope of the problem to a small example?
Is the parallel netCDF package a possible solution for you?
Regards,
Steve Emmerson
--------Begin Original Message
From: Serban G Constantinescu <address@hidden>
To: address@hidden
Subject: e-mail about netcdf problems on a 32 bit PC cluster
I am contacting you about a SUPPORT REQUEST FORM that I filled yesterday
about The problems which we have when we try to write large amounts of
data in netcdf Using a massively parallel fortran90 code.
Email was submitted from following website:
http://my.unidata.ucar.edu/content/support/email_support.php
Could you please confirm you received it?
Do you know about how much time it takes to get an answer?
Thank you for your help.
Best regards
George Constantinescu
Assistant Professor
Dept. Civil and Environmental Engineering
The University of Iowa
Package: netCDF Fortran (77 + 90)
Package version: 3.5.1
Operating System: Redhat Linux 2.4.9-e.49smp #1 SMP
Hardware Information: 64-node, 128-CPU, Linux-based computing cluster
running MPICH -1.2.5..12 from Myrinet, Sun Grid Engine 5.3, and Sun
Control Station 2.0. Compute nodes (64) are x86-based Sun Fire V60x
servers (see: http://www.sun.com/servers/entry/v60x/). Head nodes (2)
are x86-based Sun Fire V65x servers (see:
http://www.sun.com/servers/entry/v65x/). Compute nodes have two 36 GB
disk drives. Apple Stoarge Array for shared storage. SMC network for
transmitting data from the nodes to the Apple storage array (three SMC 3
SMC TigerSwitch 10/100/1000 8624T 24-port switches). Myrinet switch for
internode communications.
Subject: nf_enddef() Input/output error
Description:
Hello,
We have a CFD Fortran MPI/netCDF parallel code which exhibits
"Input/output error" (Error 5) upon calling nf_enddef(). The code runs
with 24 MPI processes. At the end of computation, the resulting data is
written to disk via netCDF. Each MPI process writes to its own file;
there is no simultaneous access to any single file. Each file's size is
approximately 31 to 32 Megabytes when no error occurs. When the error
occurs, typically only the file's header is written, which is 409,600
bytes; occasionally a few megabytes of data are written. We don't have
a parallel file system, only NFS. MPI is MPICH -1.2.5..12/Myrinet.
Observations:
-- Error only occurs while writing files to a directory of an NFS
filesystem (desired).
-- Error does not occur (works fine!) when writing to local /tmp. (each
process writes to its local /tmp). (not desired, since
result files are scattered across the cluster).
-- We have 2 NFS filesystems we've tried: On one, about 23 out of 24
processes report the error (one error per process);
on the other, about 15 out of 24 processes report the error.
Could you advise us as to the cause of the error and how we might fix it?
The compiler and library versions are:
bash-2.05$ ifc -V
Intel(R) Fortran Compiler for 32-bit applications, Version 7.1 Build
20031225Z
Copyright (C) 1985-2003 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY
GNU ld version 2.11.90.0.8 (with BFD 2.11.90.0.8)
Supported emulations:
elf_i386
i386linux
elf_i386_glibc21
netcdf is version 3.5.1
mpich is version 1.2.5..12
--------End Original Message