[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20040526: bigbird status (cont.)

This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.

Subject: 20040526: bigbird status (cont.)
Date: Wed, 26 May 2004 18:02:45 -0600

>From:  Gerry Creager N5JXS <address@hidden>
>Organization:  Texas A&M University -- AATLT
>Keywords:  200405262241.i4QMfhtK005798 LDM RAID JFS

Hi Gerry,

>IM==Instant Messaging... Sometimes convenient for on-line communications 
>while remote troubleshooting...

I should have known...

>I've seen the RAID failure on reboot several times.

Interesting...  Did you run fsck (or variant) to get things patched
up before remounting the RAID filesystem?

>I really want to 
>get rid of this card and get into a 3Ware card.  New 'Net find today 
>suggests that, as suspected, Promise's proprietary RAID is less than 
>advertised.  They said nicer things 'bout HighPoint, but the only "real 
>RAID" comments were reserved for Adapeptec and 3Ware... noting Adaptec 
>followed 3Ware's lead.

I believe that Pete Pokrandt of U Wisc/AOS is using a 3Ware card in
his Linux PC.

>effectively, when rebooting, the system times out while flushing the 
>product queue now.

Product queue?  If you mean LDM product queue, that is on a different
file system.  Also, I did not see a startup script for the LDM
in /etc/init.d, so I added one:

/etc/init.d/ldmd

Since this wasn't there on reboot, the LDM queue would not have
been checked by it.

>That apparently is tied to the RAID corruption in 
>some manner.  If we really saw a RAID corruption while running today, 
>that's a first for me on this system.  Further, there's spares.  It 
>should have alarmed and fixed itself.

I agree, but the load average did go up to 400...

>I'll keep looking.  By doing the reboot, we did salvage the messages 
>logs, and there might be some clues.

OK.

>Thanks for spotting the problem.  I was working on bigfoot and didn't 
>even look at bigbird today, save to place it on a KVM.  Hmmm.  It's 
>possible that caused a hiccup, but it shouldn't have.  It's been in idle 
>state WRT the monitor, keyboard, mouse for weeks.  The reboot for the 
>box is serendipitous.  I wasn't planning to reboot 'til needed anyway, 
>so I'd not have had console access (at least for X) 'til I did. 
>Keyboard and video worked as expected...

OK.

>Later, gerry

fsck.jfs is still running on /dev/md0, and it will
take time to finish.  I will try to look in on bigbird later tonight
or early tomorrow morning.  As soon as fsck.jfs finishes, I will try
to mount /data and crank up the LDM.

Time to head home...

Tom

From: Tom Yoksas <address@hidden>
To: Gerry Creager N5JXS <address@hidden>

>From:  Gerry Creager N5JXS <address@hidden>
>Organization:  Texas A&M University -- AATLT
>Keywords:  200405262241.i4QMfhtK005798 LDM RAID JFS

The RAID was not mounted on bootup.  Is there supposed to be an entry
in /etc/fstab to mount it?  I couldn't find the entry I thought would
be there.

We tried mounting /dev/md0 as /data, but got a bad super block
message.  We are now running /sbin/fsck.jfs to check for problems.

The RAID having a failure and then not being available would fit
as a cause for the load average to ramp up to 400:  all processes
would be waiting to write to a resource that no longer existed.

More later as we discover stuff...

Tom
--
+-----------------------------------------------------------------------------+
* Tom Yoksas                                             UCAR Unidata Program *
* (303) 497-8642 (last resort)                                  P.O. Box 3000 *
* address@hidden                                   Boulder, CO 80307 *
* Unidata WWW Service                             http://www.unidata.ucar.edu/*
+-----------------------------------------------------------------------------+

Follow-Ups:
- Re: 20040526: bigbird status (cont.)
  - From: Gerry Creager N5JXS

Prev by Date: 20040518: LDM problem: Denying connection from localhost.loca l domain
Next by Date: Re: 20040526: bigbird status (cont.)
Previous by thread: 20040526: LDM/IDD issues at GoMOOS
Next by thread: Re: 20040526: bigbird status (cont.)
Index(es):
- Date
- Thread