[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20000515: pbuf_flush log messages (fwd)
- Subject: Re: 20000515: pbuf_flush log messages (fwd)
- Date: Tue, 16 May 2000 09:19:09 -0600
>
> Date: Mon, 15 May 2000 08:56:21 -0600
> From: Unidata Support <address@hidden>
> Reply-To: Chris Herbster <address@hidden>
> To: address@hidden
> Subject: 20000515: pbuf_flush log messages
>
> >To: address@hidden,
> >To: "Herbster, Chris" <address@hidden>
> >From: Chris Herbster <address@hidden>
> >Subject: pbuf_flush log messages
> >Organization: UCAR/Unidata
> >Keywords: 200005121726.e4CHQ1413099
>
> I have almost re-established our system after a hard disk failure that
> resulted in the complete loss of our LDM account (no tape device yet,
> and the tar file that I thought was on another machine evaporated -
> bummer).
>
> We used a 6 GB drive that we had in another system that has a video card
> problem. The server (opwx.db.erau.edu) is running Redhat 5.2.
>
> [herbster@opwx ~]$ uname -a
> Linux opwx.db.erau.edu 2.0.36 #8 Tue Aug 31 13:14:50 EDT 1999 i686
> unknown
>
> I am getting repeated pbuf_flush messages in the log:
>
> [ldm@opwx ~]$ head -30 logs/ldmd.log
> May 12 16:46:45 opwx rpc.ldmd[1004]: Starting Up (built: May 5 2000
> 16:28:13)
> May 12 16:46:45 opwx pluto[1008]: run_requester: Starting Up:
> pluto.met.fsu.edu
> May 12 16:46:45 opwx pluto[1008]: run_requester: 20000512154646.006
> TS_ENDT {{DDPLUS, ".*"}}
> May 12 16:46:46 opwx striker[1009]: run_requester: Starting Up:
> striker.atmos.albany.edu
> May 12 16:46:46 opwx striker[1009]: run_requester: 20000512154646.026
> TS_ENDT {{NLDN, ".*"}}
> May 12 16:46:46 opwx pqexpire[1005]: Starting Up
> May 12 16:46:46 opwx pqbinstats[1006]: Starting Up (1004)
> May 12 16:46:46 opwx pqact[1007]: Starting Up
> May 12 16:46:46 opwx striker[1009]: FEEDME(striker.atmos.albany.edu): OK
> May 12 16:46:46 opwx pluto[1008]: FEEDME(pluto.met.fsu.edu): OK
> May 12 16:46:48 opwx localhost[1015]: Connection from localhost
> May 12 16:46:48 opwx localhost[1015]: Connection reset by peer
> May 12 16:46:48 opwx localhost[1015]: Exiting
> May 12 16:47:15 opwx pqact[1007]: pbuf_flush 4: time elapsed 2.535485
> May 12 16:47:37 opwx pqact[1007]: pbuf_flush 4: time elapsed 9.433849
> May 12 16:47:49 opwx pqact[1007]: pbuf_flush 4: time elapsed 11.849885
> May 12 16:47:53 opwx pqact[1007]: pbuf_flush 4: time elapsed 3.429464
> May 12 16:47:58 opwx pqact[1007]: pbuf_flush 4: time elapsed 4.590907
> May 12 16:48:14 opwx pqact[1007]: pbuf_flush 4: time elapsed 15.620028
> May 12 16:48:22 opwx pqact[1007]: pbuf_flush 4: time elapsed 8.316928
> May 12 16:48:50 opwx pqact[1007]: pbuf_flush 4: time elapsed 26.827392
> May 12 16:49:04 opwx pqact[1007]: pbuf_flush 4: time elapsed 13.800704
> May 12 16:49:18 opwx pqact[1007]: pbuf_flush 4: time elapsed 13.448355
> May 12 16:49:34 opwx pqact[1007]: pbuf_flush 4: time elapsed 15.369003
> May 12 16:49:36 opwx pqact[1007]: pbuf_flush 4: time elapsed 2.562947
> May 12 16:49:49 opwx pqact[1007]: pbuf_flush 4: time elapsed 12.757056
> May 12 16:50:02 opwx pqact[1007]: pbuf_flush 4: time elapsed 13.213933
> May 12 16:50:16 opwx pqact[1007]: pbuf_flush 4: time elapsed 13.347354
> May 12 16:50:29 opwx pqact[1007]: pbuf_flush 4: time elapsed 13.105565
> May 12 16:51:00 opwx pqact[1007]: pbuf_flush 4: time elapsed 31.202759
> ...
> May 12 17:00:02 opwx pqact[1007]: pbuf_flush 4: time elapsed 17.271805
> May 12 17:00:11 opwx pqact[1007]: pbuf_flush 4: time elapsed 6.612698
> May 12 17:00:31 opwx pqact[1007]: pbuf_flush 4: time elapsed 20.289586
> May 12 17:00:44 opwx pqact[1007]: pbuf_flush 4: time elapsed 13.348882
> May 12 17:01:04 opwx pqact[1007]: pbuf_flush 4: time elapsed 18.993940
> May 12 17:01:22 opwx pqact[1007]: pbuf_flush 4: time elapsed 17.884479
> May 12 17:01:37 opwx pqact[1007]: pbuf_flush 4: time elapsed 15.760065
> May 12 17:01:56 opwx pqexpire[1005]: > Recycled 3467.727 kb/hr (
> 5661.799 prods per hour)
> May 12 17:02:02 opwx pqact[1007]: pbuf_flush 4: time elapsed 24.224632
>
> Here is our request from ldmd.conf:
>
> # LDM5 servers request data
> #
> # request <feedset> <pattern> <hostname pattern>
> #
> request DDPLUS ".*" pluto.met.fsu.edu
> #request HDS ".*" pluto.met.fsu.edu
> #request IDS ".*" pluto.met.fsu.edu
> #request MCIDAS ".*" pluto.met.fsu.edu
>
> request NLDN ".*" striker.atmos.albany.edu
>
> This data volume doesn't seem to be too ambitious for testing the
> system, is it?
>
> I've looked in the ldm-support archive and think that this is related to
> having a "slow hard disk" - is this true? (It is an ATA/IDE disk.)
>
> hdc: WDC AC26400R, 6149MB w/512kB Cache, CHS=13328/15/63
>
> Please let me know what, if anything, I can do to fix this.
>
> Thanks!!
>
> Chris Herbster
Good Morning, Chris!
No, your data volume is not too ambitious. Here's what's going on: pqact is
trying to send data down a pipe, presumably to a decoder. The buffer got full
because the resource at the end of the pipe was not available
at the time, so the write had to block causing it to take more than 1 second.
These messages are informational - they may not be a problem if they occur
infrequently and you're not losing data. But, the frequency and timing of your
messages seem to indicate a significant problem.
The question is, why was the resource unavailable? The problem could indeed be
a
slow
disk like you suggested, although it would be pretty darn slow if it's
struggling
with your relatively small amount of data. Or maybe the decoder timed out or
exited, although there would probably be other messages if that was the case.
So, I wouldn't think that the messages would occur with such frequency if the
problem was either of those causes. Maybe some other unrelated process is tying
up the disk. Or maybe the write permissions aren't set properly in your
directory
structure.
Another possibility is a "SCSI termination problem", but that sounds pretty
speculative to me. Still, if you exhaust other possibilities, take a look at
http://www.unidata.ucar.edu/cgi-bin/mfs/65/2435?39#mfs for more on this topic.
So, check your permissions, ensure that any required decoders are indeed
running
(although I would think you'd see a different message if they weren't), use
'top'
to see what processes are running that might be hogging the disk, and ensure
that
there's plenty of space on the disk. If disk speed alone really is your
problem,
possible solutions are: get a faster disk, decrease the number of processes
using
the disk, or offload some processing to another machine.
Good luck! When you discover the cause of your problem, it is always helpful
to
us and others in our community to let us know.
Anne
--
***************************************************
Anne Wilson UCAR Unidata Program
address@hidden P.O. Box 3000
Boulder, CO 80307
----------------------------------------------------
Unidata WWW server http://www.unidata.ucar.edu/
****************************************************