[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[netCDF #PDL-125161]: Writing parallel files with zero-size chunks
- Subject: [netCDF #PDL-125161]: Writing parallel files with zero-size chunks
- Date: Tue, 28 Aug 2012 16:18:52 -0600
Hi Thomas,
> I have an MPI-parallel application with a decomposition such that each array
> is
> completely handled by one process for I/O purposes and the arrays are
> distributed in a round-robin fashion, i.e. task 0 holds all of array A, task 1
> holds all of array B and so forth.
>
> My expectation was that I could write this with netcdf4 parallel I/O, so I
> compiled netcdf 4.2.1.1 for OpenMPI 1.4.2 and hdf5 1.8.9 on Debian GNU/Linux
> x86_64 and started testing.
>
> Unfortunately, when I issue nc_create_par with NC_MPIPOSIX and
> nc_var_par_access
> with flag NC_INDEPENDENT I only get invalid output, when I change the
> nc_create_par option to NC_MPIIO the program hangs on nc_close.
>
> I've reduced my use-case to a small test mostly resembling one of the
> demonstration programs. I think the most relevant part is that the processes
> not
> having any elements from the array each use start and count values of 0 for
> every dimension.
>
> Please see the attached files for more information.
>
> When running the attached program with
>
> $ mpirun -n 5 ./nc4partest
> mpi_name: taifun size: 5 rank: 0, isDataWriter=0
> mpi_name: taifun size: 5 rank: 1, isDataWriter=0
> mpi_name: taifun size: 5 rank: 2, isDataWriter=1
> mpi_name: taifun size: 5 rank: 4, isDataWriter=0
> mpi_name: taifun size: 5 rank: 3, isDataWriter=0
> mpi_rank=1 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> mpi_rank=2 start[0]=0 start[1]=0 count[0]=24 count[1]=24
> mpi_rank=0 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> mpi_rank=3 start[0]=0 start[1]=0 count[0]=0 count[1]=0
> mpi_rank=4 start[0]=0 start[1]=0 count[0]=0 count[1]=0
>
> and from this point on the program hangs.
>
> I've tried to locate a hint how to use the nc_put_vara_int call for this case
> but found nothing.
>
> Do I have to redistribute the data before writing? Are there other values for
> start/count I could use?
I just succeeded in running a test case that used count[0] = 0 on an MPI
parallel
file system using the netCDF-4 parallel I/O inherited from HDF5, and it ran
fine.
The test I ran just inserted the following code in a loop after line 136 in
nc_test4/tst_parallel.c:
/* See if count dimension == 0 returns error */
count_save = count[0];
count[0] = 0;
if (nc_put_vara_int(ncid, v1id, start, count, slab_data)) ERR;
count[0] = count_save ;
Discussing this with CISL consultants indicates the problem may be
platform-specific.
--Russ
Russ Rew UCAR Unidata Program
address@hidden http://www.unidata.ucar.edu
Ticket Details
===================
Ticket ID: PDL-125161
Department: Support netCDF
Priority: Normal
Status: Closed