This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Looking more closely at the HDF5-1.9.10-patch1 release notes, I cut and pasted the wrong entry, as that was just the addition of a couple of new functions and unlikely to fix anything. So, never mind ... --Russ > Hi Orion, > > I note that HDF5-1.8.10-patch1, the current and latest release, has fixes for > parallel I/O > problems, described in the RELEASE_NOTES as: > > Parallel Library > ---------------- > - Added the H5Pget_mpio_no_collective_cause() function that retrieves > reasons why the collective I/O was broken during read/write IO access. > (JKM - 2012/08/30 HDFFV-8143) > > - Added H5Pget_mpio_actual_io_mode_f (MSB - 2012/09/27) > > Would it be practical/convenient for you to rerun your test using > HDF5-1.8.10-patch1? > I can't tell from the above if the latest fixes are relevant to the problem > you're > reporting, but it seems like a time-saving possibility. > > --Russ > > > Since I'm trying to track down some various netcdf mpi issues I'm seeing on > > Fedora, here is another. > > > > I started seeing this on Dec 3 trying to rebuild netcdf 4.2.1.1 for hdf5 > > 1.8.10 with: > > mpich2 1.5 > > gcc 4.7.2-8. > > > > A previous build on Nov 1 succeeded with: > > hdf5 1.8.9 > > gcc 4.7.2-6. > > mpich2 1.5 > > > > So I suspect a change in hdf5 between 1.8.9 and 1.8.10. > > > > > > I'm currently testing with netcdf 4.3.0-rc1, gcc 4.8.0-0.14, hdf5-1.8.10, > > mpich2 1.5. > > > > The test hangs here: > > > > Testing very simple parallel I/O with 4 processors... > > > > *** tst_parallel testing very basic parallel access. > > *** tst_parallel testing whether we can create file for parallel access and > > write to it... > > > > Three of the four processes traces when attached with gdb looks like: > > (gdb) bt > > #0 0x0000003819ab86b1 in MPID_nem_tcp_connpoll () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #1 0x0000003819aa5fd5 in MPIDI_CH3I_Progress () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #2 0x0000003819a601ad in MPIC_Wait () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #3 0x0000003819a60852 in MPIC_Sendrecv () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #4 0x0000003819a60cb4 in MPIC_Sendrecv_ft () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #5 0x0000003819adb172 in MPIR_Barrier_intra () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #6 0x0000003819adb26d in MPIR_Barrier_or_coll_fn () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #7 0x0000003819adb711 in MPIR_Barrier_impl () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #8 0x0000003819adba58 in PMPI_Barrier () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #9 0x0000003818a642a9 in H5AC_rsp__dist_md_write__flush (f=0x20d2780, > > dxpl_id=167772175, > > cache_ptr=0x7f3d0bc2b010) at ../../src/H5AC.c:4424 > > #10 0x0000003818a650c2 in H5AC_run_sync_point (f=0x20d2780, > > dxpl_id=167772175, > > sync_point_op=1) > > at ../../src/H5AC.c:4870 > > #11 0x0000003818a65532 in H5AC_flush_entries (f=0x20d2780) at > > ../../src/H5AC.c:5050 > > #12 0x0000003818a5c7d5 in H5AC_flush (f=0x20d2780, dxpl_id=167772174) at > > ../../src/H5AC.c:838 > > #13 0x0000003818ae490d in H5F_flush (f=0x20d2780, dxpl_id=167772174, > > closing=0) > > at ../../src/H5F.c:1758 > > #14 0x0000003818af0fba in H5F_flush_mounts_recurse (f=0x20d2780, > > dxpl_id=167772174) > > at ../../src/H5Fmount.c:659 > > #15 0x0000003818af1175 in H5F_flush_mounts (f=0x20d2780, dxpl_id=167772174) > > at ../../src/H5Fmount.c:698 > > #16 0x0000003818ae4648 in H5Fflush (object_id=16777216, > > scope=H5F_SCOPE_GLOBAL) > > at ../../src/H5F.c:1704 > > #17 0x00007f3d0e24199c in sync_netcdf4_file (h5=0x20d1270) at > > ../../libsrc4/nc4file.c:2964 > > #18 0x00007f3d0e242862 in NC4_enddef (ncid=<optimized out>) at > > ../../libsrc4/nc4file.c:2922 > > #19 0x00007f3d0e1f44d2 in nc_enddef (ncid=65536) at > > ../../libdispatch/dfile.c:786 > > #20 0x0000000000400f59 in main (argc=1, argv=0x7fffece50d88) > > at ../../nc_test4/tst_parallel.c:111 > > > > > > The other looks like: > > (gdb) bt > > #0 0x0000003819aa6005 in MPIDI_CH3I_Progress () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #1 0x0000003819a601ad in MPIC_Wait () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #2 0x0000003819a60436 in MPIC_Recv () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #3 0x0000003819a60af9 in MPIC_Recv_ft () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #4 0x0000003819addab2 in MPIR_Bcast_binomial.isra.1 () > > from /usr/lib64/mpich2/lib/libmpich.so.8 > > #5 0x0000003819addef3 in MPIR_Bcast_intra () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #6 0x0000003819adeb7d in MPIR_Bcast_impl () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #7 0x0000003819ad90c7 in MPIR_Allreduce_intra () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #8 0x0000003819ada6f2 in MPIR_Allreduce_impl () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #9 0x0000003819adacde in PMPI_Allreduce () from > > /usr/lib64/mpich2/lib/libmpich.so.8 > > #10 0x0000003818ac1f7b in H5D__mpio_opt_possible (io_info=0x7ffffa47e2a0, > > file_space=0x84e150, > > mem_space=0x867950, type_info=0x7ffffa47e220, fm=0x7ffffa47e380, > > dx_plist=0x854540) > > at ../../src/H5Dmpio.c:241 > > #11 0x0000003818ac0050 in H5D__ioinfo_adjust (io_info=0x7ffffa47e2a0, > > dset=0x854900, > > dxpl_id=167772189, file_space=0x84e150, mem_space=0x867950, > > type_info=0x7ffffa47e220, > > fm=0x7ffffa47e380) at ../../src/H5Dio.c:999 > > #12 0x0000003818abf1bc in H5D__write (dataset=0x854900, > > mem_type_id=50331660, > > mem_space=0x867950, file_space=0x84e150, dxpl_id=167772189, > > buf=0x7ffffa489170) > > at ../../src/H5Dio.c:667 > > #13 0x0000003818abd8e9 in H5Dwrite (dset_id=83886083, mem_type_id=50331660, > > mem_space_id=67108867, file_space_id=67108866, dxpl_id=167772189, > > buf=0x7ffffa489170) > > at ../../src/H5Dio.c:265 > > #14 0x00007f407ab6992a in nc4_put_vara (nc=<optimized out>, > > ncid=ncid@entry=65536, > > varid=varid@entry=0, startp=startp@entry=0x7ffffa489130, > > countp=countp@entry=0x7ffffa489150, mem_nc_type=mem_nc_type@entry=4, > > is_long=is_long@entry=0, data=data@entry=0x7ffffa489170) at > > ../../libsrc4/nc4hdf.c:795 > > #15 0x00007f407ab6418b in nc4_put_vara_tc (mem_type_is_long=0, > > op=0x7ffffa489170, > > countp=0x7ffffa489150, startp=0x7ffffa489130, mem_type=4, varid=0, > > ncid=65536) > > at ../../libsrc4/nc4var.c:1350 > > #16 NC4_put_vara (ncid=65536, varid=0, startp=0x7ffffa489130, > > countp=0x7ffffa489150, > > op=0x7ffffa489170, memtype=4) at ../../libsrc4/nc4var.c:1484 > > #17 0x00007f407ab17075 in NC_put_vara (ncid=ncid@entry=65536, > > varid=varid@entry=0, > > start=start@entry=0x7ffffa489130, edges=edges@entry=0x7ffffa489150, > > value=value@entry=0x7ffffa489170, memtype=memtype@entry=4) > > at ../../libdispatch/dvarput.c:79 > > #18 0x00007f407ab17f0f in nc_put_vara_int (ncid=65536, varid=0, > > startp=startp@entry=0x7ffffa489130, countp=countp@entry=0x7ffffa489150, > > op=op@entry=0x7ffffa489170) at ../../libdispatch/dvarput.c:628 > > #19 0x0000000000401010 in main (argc=1, argv=0x7ffffa489648) > > at ../../nc_test4/tst_parallel.c:138 > > > > > > With openmpi 1.6.3, it appears to hang at the previous test: > > > > Testing very simple parallel I/O with 4 processors... > > > > *** tst_parallel testing very basic parallel access. > > > > > > Similar backtraces: > > > > Three with: > > (gdb) bt > > #0 0x00000037a6cda4c7 in sched_yield () at > > ../sysdeps/unix/syscall-template.S:81 > > #1 0x000000381a317a5d in opal_progress () from > > /usr/lib64/openmpi/lib/libmpi.so.1 > > #2 0x000000381a261acd in ompi_request_default_wait_all () > > from /usr/lib64/openmpi/lib/libmpi.so.1 > > #3 0x00007f44f1d3a6e7 in ompi_coll_tuned_sendrecv_actual () > > from /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so > > #4 0x00007f44f1d423ae in ompi_coll_tuned_barrier_intra_recursivedoubling () > > from /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so > > #5 0x000000381a26fc86 in PMPI_Barrier () from > > /usr/lib64/openmpi/lib/libmpi.so.1 > > #6 0x00007f44f7cc85c4 in H5AC_rsp__dist_md_write__flush (f=0x1924a30, > > dxpl_id=167772175, > > cache_ptr=0x19256d0) at ../../src/H5AC.c:4424 > > #7 0x00007f44f7cc93e1 in H5AC_run_sync_point (f=0x1924a30, > > dxpl_id=167772175, > > sync_point_op=1) > > at ../../src/H5AC.c:4870 > > #8 0x00007f44f7cc9851 in H5AC_flush_entries (f=0x1924a30) at > > ../../src/H5AC.c:5050 > > #9 0x00007f44f7cc0ad0 in H5AC_flush (f=0x1924a30, dxpl_id=167772174) at > > ../../src/H5AC.c:838 > > #10 0x00007f44f7d48d15 in H5F_flush (f=0x1924a30, dxpl_id=167772174, > > closing=0) > > at ../../src/H5F.c:1758 > > #11 0x00007f44f7d553c2 in H5F_flush_mounts_recurse (f=0x1924a30, > > dxpl_id=167772174) > > at ../../src/H5Fmount.c:659 > > #12 0x00007f44f7d5557d in H5F_flush_mounts (f=0x1924a30, dxpl_id=167772174) > > at ../../src/H5Fmount.c:698 > > #13 0x00007f44f7d48a50 in H5Fflush (object_id=16777216, > > scope=H5F_SCOPE_GLOBAL) > > at ../../src/H5F.c:1704 > > #14 0x00007f44f8537adc in sync_netcdf4_file (h5=0x191cc50) at > > ../../libsrc4/nc4file.c:2964 > > #15 0x00007f44f85389a2 in NC4_enddef (ncid=<optimized out>) at > > ../../libsrc4/nc4file.c:2922 > > #16 0x00007f44f84ea612 in nc_enddef (ncid=65536) at > > ../../libdispatch/dfile.c:786 > > #17 0x0000000000400f88 in main (argc=1, argv=0x7fff476e1958) > > at ../../nc_test4/tst_parallel.c:111 > > > > One with: > > (gdb) bt > > #0 0x00000037a6cda4c7 in sched_yield () at > > ../sysdeps/unix/syscall-template.S:81 > > #1 0x000000381a317a5d in opal_progress () from > > /usr/lib64/openmpi/lib/libmpi.so.1 > > #2 0x000000381a261acd in ompi_request_default_wait_all () > > from /usr/lib64/openmpi/lib/libmpi.so.1 > > #3 0x00007f26db7d4c99 in ompi_coll_tuned_allreduce_intra_recursivedoubling > > () > > from /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so > > #4 0x000000381a26e66b in PMPI_Allreduce () from > > /usr/lib64/openmpi/lib/libmpi.so.1 > > #5 0x00007f26e17be321 in H5D__mpio_opt_possible (io_info=0x7fffbcbb0110, > > file_space=0x210d3b0, mem_space=0x21afc20, type_info=0x7fffbcbb0090, > > fm=0x7fffbcbb0200, > > dx_plist=0x219c7e0) at ../../src/H5Dmpio.c:241 > > #6 0x00007f26e17bc3ee in H5D__ioinfo_adjust (io_info=0x7fffbcbb0110, > > dset=0x219cba0, > > dxpl_id=167772189, file_space=0x210d3b0, mem_space=0x21afc20, > > type_info=0x7fffbcbb0090, > > fm=0x7fffbcbb0200) at ../../src/H5Dio.c:999 > > #7 0x00007f26e17bb550 in H5D__write (dataset=0x219cba0, > > mem_type_id=50331660, > > mem_space=0x21afc20, file_space=0x210d3b0, dxpl_id=167772189, > > buf=0x7fffbcbbaff0) > > at ../../src/H5Dio.c:667 > > #8 0x00007f26e17b9c7d in H5Dwrite (dset_id=83886083, mem_type_id=50331660, > > mem_space_id=67108867, file_space_id=67108866, dxpl_id=167772189, > > buf=0x7fffbcbbaff0) > > at ../../src/H5Dio.c:265 > > #9 0x00007f26e1fd9a6a in nc4_put_vara (nc=<optimized out>, > > ncid=ncid@entry=65536, > > varid=varid@entry=0, startp=startp@entry=0x7fffbcbbafb0, > > countp=countp@entry=0x7fffbcbbafd0, mem_nc_type=mem_nc_type@entry=4, > > is_long=is_long@entry=0, data=data@entry=0x7fffbcbbaff0) at > > ../../libsrc4/nc4hdf.c:795 > > #10 0x00007f26e1fd42cb in nc4_put_vara_tc (mem_type_is_long=0, > > op=0x7fffbcbbaff0, > > countp=0x7fffbcbbafd0, startp=0x7fffbcbbafb0, mem_type=4, varid=0, > > ncid=65536) > > at ../../libsrc4/nc4var.c:1350 > > #11 NC4_put_vara (ncid=65536, varid=0, startp=0x7fffbcbbafb0, > > countp=0x7fffbcbbafd0, > > op=0x7fffbcbbaff0, memtype=4) at ../../libsrc4/nc4var.c:1484 > > #12 0x00007f26e1f871b5 in NC_put_vara (ncid=ncid@entry=65536, > > varid=varid@entry=0, > > start=start@entry=0x7fffbcbbafb0, edges=edges@entry=0x7fffbcbbafd0, > > value=value@entry=0x7fffbcbbaff0, memtype=memtype@entry=4) > > at ../../libdispatch/dvarput.c:79 > > #13 0x00007f26e1f8804f in nc_put_vara_int (ncid=65536, varid=0, > > startp=startp@entry=0x7fffbcbbafb0, countp=countp@entry=0x7fffbcbbafd0, > > op=op@entry=0x7fffbcbbaff0) at ../../libdispatch/dvarput.c:628 > > #14 0x000000000040103b in main (argc=1, argv=0x7fffbcbbb528) > > at ../../nc_test4/tst_parallel.c:138 > > > > -- > > Orion Poplawski > > Technical Manager 303-415-9701 x222 > > NWRA, Boulder Office FAX: 303-415-9702 > > 3380 Mitchell Lane address@hidden > > Boulder, CO 80301 http://www.nwra.com > > > > > > Russ Rew UCAR Unidata Program > address@hidden http://www.unidata.ucar.edu > > Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu Ticket Details =================== Ticket ID: PMD-881681 Department: Support netCDF Priority: Normal Status: Closed