[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[netCDF #RQB-854711]: MPI/IO with unlimited dimensions
- Subject: [netCDF #RQB-854711]: MPI/IO with unlimited dimensions
- Date: Wed, 11 Sep 2013 15:55:53 -0600
Hi Sebastian,
> I tried to compile with netcCDF 4.3.1-rc2 but now, my program now
> craches because of an MPI error:
>
> *** An error occurred in MPI_Allreduce: the reduction operation MPI_MAX
> is not defined on the MPI_BYTE datatype
> *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
> *** MPI_ERR_OP: invalid reduce operation
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>
> I'm using OpenMPI 1.4.3.
I'm assuming the program that crashes is the test.cpp you attached in
your original support question. I tried to duplicate the problem using
OpenMPI 1.7.2_1 on an OSX platform, and got a different error:
$ mpicxx test.cpp -o test -I{NCDIR}/include -I${H5DIR}/include -L${NCDIR}/lib
-L${H5DIR}/lib -lnetcdf -lhdf5_hl -lhdf5 -ldl -lm -lz -lcurl
./test
Start on rank 0: 0 0
Count on rank 0: 1 0
Assertion failed: (size), function H5MM_calloc, file ../../src/H5MM.c, line
95.
[mort:71677] *** Process received signal ***
[mort:71677] Signal: Abort trap: 6 (6)
[mort:71677] Signal code: (0)
[mort:71677] [ 0] 2 libsystem_c.dylib 0x00007fff939b994a
_sigtramp + 26
[mort:71677] [ 1] 3 ??? 0x0000000000000000
0x0 + 0
[mort:71677] [ 2] 4 libsystem_c.dylib 0x00007fff93a11e2a
__assert_rtn + 146
[mort:71677] [ 3] 5 test 0x0000000108eeea10
H5MM_calloc + 256
[mort:71677] [ 4] 6 test 0x0000000108d4ca3e
H5D__chunk_io_init + 1534
[mort:71677] [ 5] 7 test 0x0000000108d8a45c
H5D__write + 4028
[mort:71677] [ 6] 8 test 0x0000000108d87460
H5D__pre_write + 3552
[mort:71677] [ 7] 9 test 0x0000000108d8658c
H5Dwrite + 732
[mort:71677] [ 8] 10 test 0x0000000108c8ac27
nc4_put_vara + 3991
[mort:71677] [ 9] 11 test 0x0000000108ca0564
nc4_put_vara_tc + 164
[mort:71677] [10] 12 test 0x0000000108ca04ab
NC4_put_vara + 75
[mort:71677] [11] 13 test 0x0000000108c08240
NC_put_vara + 288
[mort:71677] [12] 14 test 0x0000000108c092d4
nc_put_vara_int + 100
[mort:71677] [13] 15 test 0x0000000108bf2e56
main + 630
[mort:71677] [14] 16 libdyld.dylib 0x00007fff886fd7e1
start + 0
[mort:71677] [15] 17 ??? 0x0000000000000001
0x0 + 1
[mort:71677] *** End of error message ***
Abort
> I think, the bug was introduced in this commit:
> https://github.com/Unidata/netcdf-c/pull/4
We're looking at the problem, thanks for reporting it.
--Russ
> Best regards,
> Sebastian
>
> On 22.08.2013 18:28, Unidata netCDF Support wrote:
> > Hi Sebastian,
> >
> >> my problem sounds similar but to the bug but it is different. My program
> >> also hangs when using collective MPI I/O.
> >>
> >> According to the bug report, only an issue with independent I/O was fixed.
> >
> > You're right, but we think we have a fix for the collective I/O hang now,
> > available in the netCDF-C 4.3.1-rc2 version (a release candidate):
> >
> > https://github.com/Unidata/netcdf-c/releases/tag/v4.3.1-rc2
> >
> > At your convenience, please let us know if it fixes the problem.
> >
> > --Russ
> >
> >> On 06.08.2013 00:09, Unidata netCDF Support wrote:
> >>> Hi Sebastian,
> >>>
> >>> Could you tell us if this recently fixed bug sounds like what you
> >>> found?
> >>>
> >>> https://bugtracking.unidata.ucar.edu/browse/NCF-250
> >>>
> >>> If so, the fix will be in netCDF release 4.3.1, a release candidate
> >>> for which will soon be announced.
> >>>
> >>> --Russ
> >>>
> >>>> Hi everybody,
> >>>>
> >>>> I just figured out that using collective MPI/IO in variables with
> >>>> unlimited dimensions can lead to deadlocks or wrong files.
> >>>>
> >>>> I have attached a small example program which can reproduce deadlock
> >>>> (and wrong output files depending on the variable "count").
> >>>>
> >>>> Did I do anything wrong or is this a known bug?
> >>>>
> >>>> My configuration:
> >>>> hdf5 1.8.11
> >>>> netcdf 4.3
> >>>> openmpi (default ubuntu installation)
> >>>>
> >>>> Compile command:
> >>>> mpicxx test.cpp -I/usr/local/include -L/usr/local/lib -lnetcdf -lhdf5_hl
> >>>> -lhdf5 -lz
> >>>> (netcdf and hdf5 are installed in /usr/local)
> >>>>
> >>>> Best regards,
> >>>> Sebastian
> >>>>
> >>>> --
> >>>> Sebastian Rettenberger, M.Sc.
> >>>> Technische Universität München
> >>>> Department of Informatics
> >>>> Chair of Scientific Computing
> >>>> Boltzmannstrasse 3, 85748 Garching, Germany
> >>>> http://www5.in.tum.de/
> >>>>
> >>>>
> >>> Russ Rew UCAR Unidata Program
> >>> address@hidden http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: RQB-854711
> >>> Department: Support netCDF
> >>> Priority: Normal
> >>> Status: Closed
> >>>
> >>
> >> --
> >> Sebastian Rettenberger, M.Sc.
> >> Technische Universität München
> >> Department of Informatics
> >> Chair of Scientific Computing
> >> Boltzmannstrasse 3, 85748 Garching, Germany
> >> http://www5.in.tum.de/
> >>
> >>
> >>
> > Russ Rew UCAR Unidata Program
> > address@hidden http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: RQB-854711
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
>
> --
> Sebastian Rettenberger, M.Sc.
> Technische Universität München
> Department of Informatics
> Chair of Scientific Computing
> Boltzmannstrasse 3, 85748 Garching, Germany
> http://www5.in.tum.de/
>
>
>
> Hello,
>
> I tried to compile with netcCDF 4.3.1-rc2 but now, my program now
> craches because of an MPI error:
>
> *** An error occurred in MPI_Allreduce: the reduction operation MPI_MAX
> is not defined on the MPI_BYTE datatype
> *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
> *** MPI_ERR_OP: invalid reduce operation
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>
> I'm using OpenMPI 1.4.3.
>
> I think, the bug was introduced in this commit:
> https://github.com/Unidata/netcdf-c/pull/4
>
> Best regards,
> Sebastian
>
> On 22.08.2013 18:28, Unidata netCDF Support wrote:
> > Hi Sebastian,
> >
> >> my problem sounds similar but to the bug but it is different. My program
> >> also hangs when using collective MPI I/O.
> >>
> >> According to the bug report, only an issue with independent I/O was fixed.
> >
> > You're right, but we think we have a fix for the collective I/O hang now,
> > available in the netCDF-C 4.3.1-rc2 version (a release candidate):
> >
> > https://github.com/Unidata/netcdf-c/releases/tag/v4.3.1-rc2
> >
> > At your convenience, please let us know if it fixes the problem.
> >
> > --Russ
> >
> >> On 06.08.2013 00:09, Unidata netCDF Support wrote:
> >>> Hi Sebastian,
> >>>
> >>> Could you tell us if this recently fixed bug sounds like what you
> >>> found?
> >>>
> >>> https://bugtracking.unidata.ucar.edu/browse/NCF-250
> >>>
> >>> If so, the fix will be in netCDF release 4.3.1, a release candidate
> >>> for which will soon be announced.
> >>>
> >>> --Russ
> >>>
> >>>> Hi everybody,
> >>>>
> >>>> I just figured out that using collective MPI/IO in variables with
> >>>> unlimited dimensions can lead to deadlocks or wrong files.
> >>>>
> >>>> I have attached a small example program which can reproduce deadlock
> >>>> (and wrong output files depending on the variable "count").
> >>>>
> >>>> Did I do anything wrong or is this a known bug?
> >>>>
> >>>> My configuration:
> >>>> hdf5 1.8.11
> >>>> netcdf 4.3
> >>>> openmpi (default ubuntu installation)
> >>>>
> >>>> Compile command:
> >>>> mpicxx test.cpp -I/usr/local/include -L/usr/local/lib -lnetcdf -lhdf5_hl
> >>>> -lhdf5 -lz
> >>>> (netcdf and hdf5 are installed in /usr/local)
> >>>>
> >>>> Best regards,
> >>>> Sebastian
> >>>>
> >>>> --
> >>>> Sebastian Rettenberger, M.Sc.
> >>>> Technische Universität München
> >>>> Department of Informatics
> >>>> Chair of Scientific Computing
> >>>> Boltzmannstrasse 3, 85748 Garching, Germany
> >>>> http://www5.in.tum.de/
> >>>>
> >>>>
> >>> Russ Rew UCAR Unidata Program
> >>> address@hidden http://www.unidata.ucar.edu
> >>>
> >>>
> >>>
> >>> Ticket Details
> >>> ===================
> >>> Ticket ID: RQB-854711
> >>> Department: Support netCDF
> >>> Priority: Normal
> >>> Status: Closed
> >>>
> >>
> >> --
> >> Sebastian Rettenberger, M.Sc.
> >> Technische Universität München
> >> Department of Informatics
> >> Chair of Scientific Computing
> >> Boltzmannstrasse 3, 85748 Garching, Germany
> >> http://www5.in.tum.de/
> >>
> >>
> >>
> > Russ Rew UCAR Unidata Program
> > address@hidden http://www.unidata.ucar.edu
> >
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: RQB-854711
> > Department: Support netCDF
> > Priority: Normal
> > Status: Closed
> >
>
> --
> Sebastian Rettenberger, M.Sc.
> Technische Universität München
> Department of Informatics
> Chair of Scientific Computing
> Boltzmannstrasse 3, 85748 Garching, Germany
> http://www5.in.tum.de/
>
>
>
Russ Rew UCAR Unidata Program
address@hidden http://www.unidata.ucar.edu
Ticket Details
===================
Ticket ID: RQB-854711
Department: Support netCDF
Priority: Normal
Status: Closed