[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[netCDF #TIR-820282]: NetCDF-4 Parallel independent access with unlimited dimension (Fortran 90)
- Subject: [netCDF #TIR-820282]: NetCDF-4 Parallel independent access with unlimited dimension (Fortran 90)
- Date: Fri, 26 Apr 2013 15:34:01 -0600
Reto,
I've finally created a Jira ticket for this issue, in case you want to follow
its status:
https://bugtracking.unidata.ucar.edu/browse/NCF-250
--Russ
> Russ,
>
> So, I've now also recompiled the whole NetCDF/HDF5 suite with MPICH 3.0.3
> instead of Openmpi. Same story.
>
> I've traced down the blocking statement to the HDF5 library called from the
> netcdf library during nc_put_vara_int:
>
> In nc4hdf.c (around line 770) it is calling the H5D.c routine H5Dset_extent:
> if (H5Dset_extent(var->hdf_datasetid, xtend_size) < 0)
> BAIL(NC_EHDFERR);
>
> This is where write processes wait during an independent write operation
> involving 1 unlimited dimension (where the dataset extent needs to be
> extended) when not all processes take place of the write operation.
>
> Reto
>
>
> On Apr 12, 2013, at 7:32 PM, Unidata netCDF Support wrote:
>
> > Reto,
> >
> >> Yes, the POSIX parallel I/O tests fail on OSX with OpenMPI, but that is
> >> fine. OSX and OpenMPI uses MPIIO. So to my understanding the parallel
> >> tests are ok if either POSIX or MPIIO work and the other one fails.
> >>
> >> I am actually not using a parallel file system on OSX. I use the regular
> >> file system (basic OSX installation) and I think that the parallel I/O has
> >> to work in collective and independent mode even when using a regular file
> >> system.
> >
> > I'm curious how you installed parallel HDF5, because my "make check" fails
> > before finishing
> > the tests. Did you build HDF5 without --enable-parallel, or without using
> > CC=mpicc? Or did
> > you build it with parallel I/O, but run "make install" even though "make
> > check" failed as a
> > result of not having a parallel file system?
> >
> > --Russ
> >
> >> I will test the same installation on Linux and then start debugging on
> >> OSX, and maybe we find out something.
> >>
> >> Btw. the netcdf-fortran 4.4 beta failed to compile alltogether on OSX, so
> >> I'm still using netcdf-fortran 4.2.
> >>
> >> Have a great weekend,
> >>
> >> Reto
> >>
> >>
> >> On Apr 12, 2013, at 5:59 PM, Unidata netCDF Support wrote:
> >>
> >>> Reto,
> >>>
> >>>> I've tried the following configuration
> >>>> - hdf5 1.8.11-snap16
> >>>> - netcdf-4.3.0-rc4
> >>>> - netcdf-fortran-4.2
> >>>> - openmpi-1.6.3
> >>>> - gcc/gfortran 4.6.3
> >>>>
> >>>> Same issue. If I let all processes do the write, then it works fine. If
> >>>> I for instance exclude process #0,1,2 or 3 from the writing, then the
> >>>> write hangs (all metadata/open/close is collective, only the write is
> >>>> independent.). It seems to me that somehow on my system all writes are
> >>>> collective by default and thus the write operation is not executed as
> >>>> independent.
> >>>>
> >>>> Do you have a configuration with openmpi on OSX somewhere around?
> >>>
> >>> Yes, I had to deactivate my mpich configuration first, but now have
> >>> openmpi 1.6.4 on
> >>> OSX 10.8.3. However, when I try to build hdf5 1.8.11-pre1 with it, using
> >>>
> >>> CC=/opt/local/lib/openmpi/bin/mpicc ./configure
> >>> make
> >>> make check
> >>>
> >>> Some tests fail in "make check", for example testing "ph5diff
> >>> h5diff_basiccl.h5", that
> >>> may be due to not having a POSIX-compliant parallel file system
> >>> installed. Also I
> >>> jut noticed that the earlier test t_posix_compliant test for
> >>> allwrite_allread_blocks
> >>> with POSIX IO failed, though it returned 0 so as not to stop the hdf5
> >>> testing.
> >>>
> >>>
> >>> Are you using a parallel file system? Do you set the environment variable
> >>> HDF5_PARAPREFIX to a directory in a parallel file system? What file
> >>> system are you
> >>> using for your parallel I/O tests?
> >>>
> >>> I'm afraid I don't know much about parallel I/O, and the netCDF parallel
> >>> I/O expert
> >>> got lured away to a different job some time ago, so we may need some help
> >>> or pointers
> >>> where to look to install a parallel file system on our OS X platform for
> >>> this kind of
> >>> testing and debugging.
> >>>
> >>>> I will start putting some debugging commands into the netcdf-fortran
> >>>> library and see where the process really hangs and whether the
> >>>> collective/independent write is executed correctly.
> >>>
> >>> Thanks, that would be helpful ...
> >>>
> >>> --Russ
> >>>
> >>>> Reto
> >>>>
> >>>>
> >>>> On Apr 9, 2013, at 11:01 PM, Unidata netCDF Support wrote:
> >>>>
> >>>>> Hi Reto,
> >>>>>
> >>>>> Sorry to have taken so long to respond to your question.
> >>>>>> I have been using NetCDF-4 Parallel I/O with the Fortran 90 interface
> >>>>>> for some time with success. Thank you for this great tool!
> >>>>>>
> >>>>>> However, I now have an issue with independent access:
> >>>>>>
> >>>>>> - NetCDF F90 Parallel access (NetCDF-4, MPIIO)
> >>>>>> - 3 fixed and 1 unlimited dimension
> >>>>>> - alle processes open/close the file and write metadata
> >>>>>> - only a few processes write to the file (-> independent access)
> >>>>>> - the write hangs. It works fine if all processes take place.
> >>>>>>
> >>>>>> I've changed your example F90 parallel I/O file simple_xy_par_wr.f90
> >>>>>> to include a unlimited dimension and independent access of only a
> >>>>>> subset of processes. Same issue. Even if I explicitly set the access
> >>>>>> type to independent for the variable. Can you reproduce the issue on
> >>>>>> your side?
> >>>>>>
> >>>>>> The following system configuration on my side:
> >>>>>> - NetCDF 4.2.1.1 and F90 interface 4.2
> >>>>>> - hdf5 1.8.9
> >>>>>> - Openmpi 1.
> >>>>>> - OSX, gcc 4.6.3
> >>>>>
> >>>>> No, I haven't been able to reproduce the issue, but I can't exactly
> >>>>> duplicate
> >>>>> your configuration easily, and there have been some updates and bug
> >>>>> fixes that
> >>>>> may have made a difference.
> >>>>>
> >>>>> First I tried this configuration, which worked fine on your attached
> >>>>> example:
> >>>>>
> >>>>> - NetCDF 4.3.0-rc4 and F90 interface 4.2
> >>>>> - hdf5 1.8.11 (release candidate from svn repository)
> >>>>> - mpich2-1.3.1
> >>>>> - Linux Fedora, mpicc, mpif90 wrapping gcc, gfortran 4.5.1
> >>>>>
> >>>>> So if you can build those versions, it should work for you. I'm not
> >>>>> sure whether
> >>>>> the fix is in netCDF-4.3.0 or in hdf5-1.8.11, but both have a fix for
> >>>>> at least one
> >>>>> parallel I/O hanging process issue:
> >>>>>
> >>>>> https://bugtracking.unidata.ucar.edu/browse/NCF-214 (fix in
> >>>>> netCDF-4.3.0)
> >>>>> https://bugtracking.unidata.ucar.edu/browse/NCF-240 (fix in
> >>>>> HDF5-1.8.11)
> >>>>>
> >>>>> --Russ
> >>>>>
> >>>>> Russ Rew UCAR Unidata Program
> >>>>> address@hidden http://www.unidata.ucar.edu
> >>>>>
> >>>>>
> >>>>>
> >>>>> Ticket Details
> >>>>> ===================
> >>>>> Ticket ID: TIR-820282
> >>>>> Department: Support netCDF
> >>>>> Priority: High
> >>>>> Status: Closed
> >>>>>
> >>>>
> >>>>
> >>>
> >>> Russ Rew UCAR Unidata Program
> >>> address@hidden http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: TIR-820282
> >>> Department: Support netCDF
> >>> Priority: High
> >>> Status: Closed
> >>>
> >>
> >>
> >
> > Russ Rew UCAR Unidata Program
> > address@hidden http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: TIR-820282
> > Department: Support netCDF
> > Priority: High
> > Status: Closed
> >
>
>
Russ Rew UCAR Unidata Program
address@hidden http://www.unidata.ucar.edu
Ticket Details
===================
Ticket ID: TIR-820282
Department: Support netCDF
Priority: Emergency
Status: Closed