[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20030618: NetCDF performance problems
- Subject: Re: 20030618: NetCDF performance problems
- Date: Mon, 30 Jun 2003 10:30:31 -0600
>To: address@hidden,
>From: Gottfried Necker <gottfried.necker@xxxxxxxxxxx>
>Subject: NetCDF performance problems.
>Organization: .
>Keywords: 200306180911.h5I9BvLd025090 netCDF 3.5.1-beta10 Fujitsu VPP
Gottfried,
I tried profiling netcdf-beta5 and netcdf-beta10 this morning on the
nc_test test program provided with the distribution, on a Solaris 8
platform, and you were right, a platform-independent performance
problem has been introduced that we need to fix before release.
Here's a comparison of the number of times px_pgin and px_pgout are
called using the netcdf-3.5.1 beta5 versus beta10 releases, using
gprof to profile and just capturing the number of calls with grep:
test/gf/beta5-xpg/src/nc_test$ gprof nc_test | grep px_pg
0.00 0.07 2788/2788 px_pgin [14]
0.00 0.00 39/2129 px_pgout [9]
...
test/gf/beta10-xpg/src/nc_test$ gprof nc_test | grep px_pg
0.02 1.43 112164/112164 px_pgin [5]
0.00 0.00 43/30624 px_pgout [9]
...
Since you've given us enough information to reproduce the problem here
on a Solaris platform, we should be able to fix the problem here and
put it into the next release.
I let you know when we have figured out what the problem is and have a
patch to test. Thanks!
--Russ
On Fri, 20 Jun 2003 11:57:56 +020, you wrote:
> Hi Gottfried,
>
> > >
> > > Another possibility would be providing you with some versions between
> > > beta3 and beta10 that would help isolate which changes caused the
> > > problem.
> > I tried with netcdf-3.5.1-beta5 and there's no problem. I went back to
> > beta10 and got the problem again. I diffed the libsrc directory and
> > the only substantial difference between these versions is in
> > posixio.c, where the call to ftruncate is replaced by calls to seek. I
> > will try to put the code with ftruncate into beta10 to see what
> > happens. But I don't have the time to do it now. I will try this on
> > friday.
>
> Thanks, just this information is a big help. I'm also anxious to hear
> what you find out when substituting ftruncate for the call to lseek.
> The revision notice we have on that change was:
>
> ... eliminated unnecessary use of ftruncate(), because it fails on
> FAT32 file systems under Linux.
>
> If this causes a performance problem on other systems, maybe we can
> find a better fix for the Linux problem.
>
> --Russ
I first thought, there is a single problem, but now I think, there are
two. The beta10 uses too much system time on the NFS and waits for I/O
on the local file systems. If I put the posixio.c (rev. 1.69) into the
beta10 source and recompile the library, the waiting for I/O problem
is gone. But now I can see the system time problem also on the local
file system. I did a PC sampling on my program with the beta10 and
compared it with beta5 and found out, that some routines (px_pgin and
px_pgout) are called many times more with beta10 than with beta5.
If these routine are really called so often, this would explain the
huge difference in system time usage. But I have no idea, why this
happens.
To illustrate the problem, here's the output of timex for beta10 (with
posixio 1.69):
real 11:19.21
user 7:41.07
sys 2:17.28
vu-user 6:03.32
vu-sys 0.00
For comparison the output for beta5:
real 10:39.11
user 9:46.32
sys 4.26
vu-user 7:56.27
vu-sys 0.00
Actually the problem is slightly worse than shown here, because the
beta10 calculation is stopped earlier.
I don't know, what could cause such a problem, but I suspect it is
also present on other platforms. But maybe on these platforms you
don't pay such a high price for calling pg_* too often.
Gottfried