[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #NNU-305318]: Catastrophic Failure of LDM System



Hi Martha,

re:
> We have had what amounts to a catastrophic failure on our system running
> LDM to process NOAAPORT data.  The system is a Dell PowerEdge 2850 running
> RedHat Linux, with the OS, including the LDM software, residing on a
> file system consisting of two disk  drives configured as a RAID-1 and
> the data being written to a file system on a RAID-5 consisting of four
> disk drives.

One quick comment.  We do not recommend putting the LDM queue on a RAID
under Linux.  Our experience is that reads/writes are exceedingly slow
in this configuration.  In fact, we have been working with another site
over the past couple of weeks and just learned that their LDM queue is
on a RAID for their NOAAPort ingest system.  This setup has apparently
caused the processing of products by 'readnoaaport' to be exceedingly
slow to the point where the 4 'readnoaaport' processes can not keep up
with the data being ingested from the NOAAPort SBN.  We recommend that
sites put their LDM queue on a non-RAID-based file system under Linux.
Interestingly, LDM queues on RAIDs under Solaris work fine.

re:
> When we came in Monday, we noticed the error lights on all six of the
> disk drives and the system will no longer boot.

This seems like a failure of the disk/RAID controller.

re:
> We are working with DELL to identify what happened, but we thought
> we should ask if any other LDM users had ever experienced such a
> catastrophic failure using this hardware configuration, or any other
> hardware configuration for that matter.

Off of the top of my head I can't remember anyone reporting this kind
of problem.  Sites do experience disk failures periodically, but these
have always been related to bad hardware.

re:
> In all probability, we will have
> to rebuild the entire system and reinstall LDM as I think the system is
> beyond recovery.

Bummer!  This may not be needed if the problem is a failure in the RAID
hardware...

re:
> Thanks for any advice you can offer.

We can not imagine how the total failure of your disk subsystems was caused
by the LDM or by processing kicked off by the LDM ** unless ** there was some
hardware on the verge of failing and read/write activity pushed it over the
edge.

Cheers,

Tom
--
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: NNU-305318
Department: Support LDM
Priority: Normal
Status: Closed