[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #IIX-893121]: Question...

This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.

Subject: [LDM #IIX-893121]: Question...
Date: Tue, 30 Oct 2007 15:53:52 -0600

Hi Gerry,

I tried to call you a couple of times earlier today; the first time your phone 
was
busy and the second time I got the canned message "all circuits are busy".

Anyway, we believe that we have found the problem on sasquatch...  there is some
sort of a memory problem that likely caused a reboot on October 15 at around
15:00.  Here is the appropriate snippit from /var/log/messages for the 15th:

Oct 15 14:25:13 sasquatch snmpd[30901]: transport socket = 12
Oct 15 14:25:13 sasquatch snmpd[30901]: Connection from - 10.2.9.37
Oct 15 14:25:13 sasquatch snmpd[30901]: transport socket = 12
Oct 15 15:00:54 sasquatch syslogd 1.4.1: restart.
Oct 15 15:00:54 sasquatch syslog: syslogd startup succeeded
Oct 15 15:00:54 sasquatch kernel: klogd 1.4.1, log source = /proc/kmsg started.
Oct 15 15:00:54 sasquatch kernel: Bootdata ok (command line is ro root=/dev/md1 
console=tty0, console=ttyS0,9600n8)
Oct 15 15:00:54 sasquatch kernel: Linux version 2.6.9-55.0.2.ELsmp 
(address@hidden) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-8)) #1 SMP Tue Jun 
26 14:14:47 EDT 2007
Oct 15 15:00:54 sasquatch kernel: BIOS-provided physical RAM map:

After the machine came back up, a 'general bus error' started being reported:

Oct 15 15:01:02 sasquatch kernel: hub 2-0:1.0: USB hub found
Oct 15 15:01:02 sasquatch kernel: hub 2-0:1.0: 3 ports detected
Oct 15 15:01:02 sasquatch kernel: md: Autodetecting RAID arrays.
Oct 15 15:01:02 sasquatch kernel: md: autorun ...
Oct 15 15:01:02 sasquatch kernel: md: ... autorun DONE.
Oct 15 15:01:02 sasquatch kernel: EDAC k8 MC0: general bus error: participating 
processor(local node response), time-out(no timeout) memory transaction 
type(generic read), mem or i/o(mem access), cache level(generic)
Oct 15 15:01:02 sasquatch kernel: MC1: CE page 0x37ab4c, offset 0x438, grain 8, 
syndrome 0x6d, row 3, channel 1, label "": k8_edac
Oct 15 15:01:02 sasquatch kernel: MC1: CE - no information available: k8_edac 
Error Overflow set
Oct 15 15:01:02 sasquatch kernel: EDAC k8 MC0: extended error code: ECC error

We think (but are not positive) that the messages being written to 
/var/log/messages
indicate that one bank of memory is bad.

That this situation has been ongoing since the 15th can be deduced from the 
size of the
/var/log/message files:

[root@sasquatch ~]# ls -alt /var/log/messages*
-rw-------  1 root root 281057310 Oct 30 16:33 /var/log/messages
-rw-------  1 root root 215566774 Oct 22 04:03 /var/log/messages.1
-rw-------  1 root root   1151633 Oct 15 04:02 /var/log/messages.2
-rw-------  1 root root   1196676 Oct 14 04:02 /var/log/messages.3
-rw-------  1 root root   1186767 Oct 13 04:02 /var/log/messages.4
-rw-------  1 root root   1164272 Oct 12 04:02 /var/log/messages.5
-rw-------  1 root root   1161643 Oct 11 04:02 /var/log/messages.6
-rw-------  1 root root   1140437 Oct 10 04:02 /var/log/messages.7
-rw-------  1 root root   1021564 Oct  9 04:02 /var/log/messages.8
-rw-------  1 root root   1163906 Oct  8 04:02 /var/log/messages.9
-rw-------  1 root root   1184374 Oct  7 04:02 /var/log/messages.10

The end result of the errors being logged in /var/log/messages is that
the delivery of EXP data products to machines downstream of sasquatch
are showing very high latencies.

Aside:  the reason that the latency for weather.renci.org was capped
at 3600 seconds is that is the default for the LDM.  I restarted the
LDM on weather.renci.org specifying '-m 10000' as the max latency
that was allowable, and the latencies show for the EXP feed have been
steadily climbing towards 10000 seconds:

http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?EXP+weather.renci.org

What to do?  I would say that it is time to take sasquatch down and try
to figure out if a bad memory DIMM really is the problem.  I would first
reseat all DIMMs; if that doesn't fix the problem, I would then try to
figure out which DIMM(s) is(are) bad.

Cheers,

Tom
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: IIX-893121
Department: Support LDM
Priority: Normal
Status: Closed

Prev by Date: [LDM #IIX-893121]: Question...
Next by Date: [LDM #YMJ-762912]: Solaris 10 and LDM
Previous by thread: [LDM #IIX-893121]: Question...
Next by thread: [LDM #IIX-893121]: Question...
Index(es):
- Date
- Thread